fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
6d224861c4
commit
fd0f4a0365
1166 changed files with 358546 additions and 0 deletions
242
.claude/skills/add-user/SKILL.md
Normal file
242
.claude/skills/add-user/SKILL.md
Normal file
|
|
@ -0,0 +1,242 @@
|
|||
---
|
||||
name: add-user
|
||||
description: |
|
||||
Add a new namespace-owner to the Kubernetes cluster. Use when:
|
||||
(1) "add user", "onboard user", "create user", "new namespace-owner",
|
||||
(2) someone new needs their own namespace and CI access,
|
||||
(3) user asks to set up cluster access for a person.
|
||||
Interactive: asks questions, updates Vault KV, applies stacks.
|
||||
---
|
||||
|
||||
# Add User
|
||||
|
||||
Add a new namespace-owner to the cluster. Two modes: **automated** (preferred) and **manual** (fallback).
|
||||
|
||||
SOPS state encryption access is **automatically provisioned** by the vault stack — per-stack Transit keys, policies, identity groups, and group aliases are all created from the `k8s_users` map. No manual SOPS setup required.
|
||||
|
||||
## Automated Flow (Preferred)
|
||||
|
||||
**Admin creates an Authentik invite → user signs up → provisioning happens automatically.**
|
||||
|
||||
### Steps
|
||||
|
||||
1. **Create Authentik Invitation**
|
||||
- Go to [Authentik Admin](https://authentik.viktorbarzin.me/if/admin/#/core/invitations)
|
||||
- Create a new invitation
|
||||
- Pre-assign the user to the **`kubernetes-namespace-owners`** group
|
||||
- Copy the invite link
|
||||
|
||||
2. **Send Invite Link to User**
|
||||
- The user clicks the link and signs up
|
||||
|
||||
3. **Automatic Provisioning (Vault KV + Authentik)**
|
||||
- Authentik fires a webhook to `webhook.viktorbarzin.me/authentik/provision`
|
||||
- The webhook handler validates the event and triggers the Woodpecker `provision-user` pipeline
|
||||
- Pipeline automatically:
|
||||
- Adds user to Vault KV (`secret/platform` → `k8s_users`) with convention defaults
|
||||
- Creates `sops-<username>` group in Authentik and assigns the user
|
||||
- Sends Slack notification with manual apply instructions
|
||||
|
||||
4. **Convention Defaults** (applied automatically)
|
||||
- Namespace: `username`
|
||||
- Quota: CPU 2, Memory 4Gi requests / 8Gi limits, 20 pods
|
||||
- Domains: none (user can request later)
|
||||
|
||||
5. **Manual Apply** (admin receives Slack notification)
|
||||
- The vault stack requires TLS certs (git-crypt) and can't run in CI. Apply manually:
|
||||
```bash
|
||||
cd /Users/viktorbarzin/code/infra
|
||||
cd stacks/vault && ../../scripts/tg apply --non-interactive && cd ../..
|
||||
cd stacks/rbac && ../../scripts/tg apply --non-interactive && cd ../..
|
||||
cd stacks/woodpecker && ../../scripts/tg apply --non-interactive && cd ../..
|
||||
```
|
||||
|
||||
6. **Post-Provisioning**
|
||||
- Send user the onboarding link: `https://k8s-portal.viktorbarzin.me/onboarding?role=namespace-owner`
|
||||
- If custom quota/domains needed, update Vault KV manually and re-apply stacks
|
||||
|
||||
### Monitoring the Pipeline
|
||||
|
||||
Watch the pipeline at: `https://ci.viktorbarzin.me` → infra repo → provision-user pipeline
|
||||
|
||||
## Manual Flow (Fallback)
|
||||
|
||||
Use when automated flow isn't available or custom configuration is needed.
|
||||
|
||||
### Step 1: Collect Information
|
||||
|
||||
Ask the user for ALL of the following before proceeding:
|
||||
|
||||
| Field | Question | Default |
|
||||
|-------|----------|---------|
|
||||
| `username` | Username (must match Forgejo username for CI) | — |
|
||||
| `email` | Email address (used for OIDC identity) | — |
|
||||
| `namespaces` | Namespace name(s) to create | `[username]` |
|
||||
| `domains` | Subdomain(s) under viktorbarzin.me for their apps | `[]` |
|
||||
| `cpu_requests` | CPU request quota | `"2"` |
|
||||
| `memory_requests` | Memory request quota | `"4Gi"` |
|
||||
| `memory_limits` | Memory limit quota | `"8Gi"` |
|
||||
| `pods` | Max pods | `"20"` |
|
||||
|
||||
Also confirm:
|
||||
- Has the user been added to the **`kubernetes-namespace-owners`** group in [Authentik](https://authentik.viktorbarzin.me)? (Manual step — admin must do this in the UI)
|
||||
- Has the user been added to the **`sops-USERNAME`** group in Authentik? (Required for terraform state decrypt — the vault stack creates the Vault external group, but the Authentik group must exist and the user must be in it)
|
||||
- Does the user need VPN access? If yes, also add to **`Headscale Users`** group in Authentik.
|
||||
|
||||
**Do NOT proceed until the Authentik group assignments are confirmed.**
|
||||
|
||||
### Step 2: Update Vault KV
|
||||
|
||||
Read the current `k8s_users` JSON from Vault, add the new entry, and write it back.
|
||||
|
||||
```bash
|
||||
# Ensure authenticated
|
||||
vault login -method=oidc
|
||||
|
||||
# Read current value
|
||||
vault kv get -format=json secret/platform | jq -r '.data.data.k8s_users' > /tmp/k8s_users.json
|
||||
|
||||
# Add the new user entry (use jq to merge)
|
||||
jq --arg user "USERNAME" \
|
||||
--arg email "EMAIL" \
|
||||
--argjson ns '["NAMESPACE"]' \
|
||||
--argjson domains '["DOMAIN1"]' \
|
||||
--argjson quota '{"cpu_requests":"2","memory_requests":"4Gi","memory_limits":"8Gi","pods":"20"}' \
|
||||
'. + {($user): {"role":"namespace-owner","email":$email,"namespaces":$ns,"domains":$domains,"quota":$quota}}' \
|
||||
/tmp/k8s_users.json > /tmp/k8s_users_updated.json
|
||||
|
||||
# Write back — must write the entire platform secret, not just k8s_users
|
||||
# First get all current keys
|
||||
vault kv get -format=json secret/platform | jq -r '.data.data' > /tmp/platform_secret.json
|
||||
|
||||
# Update k8s_users key with new JSON (as a string, since complex types are stored as JSON strings)
|
||||
jq --arg users "$(cat /tmp/k8s_users_updated.json)" '.k8s_users = $users' /tmp/platform_secret.json > /tmp/platform_updated.json
|
||||
|
||||
# Write back
|
||||
vault kv put secret/platform @/tmp/platform_updated.json
|
||||
|
||||
# Clean up
|
||||
rm -f /tmp/k8s_users.json /tmp/k8s_users_updated.json /tmp/platform_secret.json /tmp/platform_updated.json
|
||||
```
|
||||
|
||||
**Verify** the write:
|
||||
```bash
|
||||
vault kv get -field=k8s_users secret/platform | jq '.USERNAME'
|
||||
```
|
||||
|
||||
### Step 3: Apply Stacks
|
||||
|
||||
Apply in order. Use the `scripts/tg` wrapper.
|
||||
|
||||
```bash
|
||||
cd /Users/viktorbarzin/code/infra
|
||||
|
||||
# 1. Vault stack — creates namespace, Vault policy, identity entity, deployer role,
|
||||
# SOPS Transit key, SOPS policy, SOPS identity group + alias
|
||||
cd stacks/vault && ../../scripts/tg apply --non-interactive
|
||||
cd ../..
|
||||
|
||||
# 2. RBAC stack — creates RBAC bindings, ResourceQuota, TLS secret
|
||||
cd stacks/rbac && ../../scripts/tg apply --non-interactive
|
||||
cd ../..
|
||||
|
||||
# 3. Woodpecker stack — adds user to Woodpecker admin list
|
||||
cd stacks/woodpecker && ../../scripts/tg apply --non-interactive
|
||||
cd ../..
|
||||
```
|
||||
|
||||
### Step 4: Verify
|
||||
|
||||
```bash
|
||||
# Namespace exists
|
||||
kubectl get namespace USERNAME_NAMESPACE
|
||||
|
||||
# ResourceQuota applied
|
||||
kubectl describe resourcequota -n USERNAME_NAMESPACE
|
||||
|
||||
# Vault policy exists (namespace-owner + SOPS)
|
||||
vault policy read namespace-owner-USERNAME
|
||||
vault policy read sops-user-USERNAME
|
||||
|
||||
# Vault identity entity exists (with both policies)
|
||||
vault read identity/entity/name/USERNAME
|
||||
|
||||
# SOPS group exists
|
||||
vault read identity/group/name/sops-USERNAME
|
||||
|
||||
# K8s deployer role works
|
||||
vault write kubernetes/creds/NAMESPACE-deployer kubernetes_namespace=NAMESPACE
|
||||
|
||||
# SOPS Transit key exists
|
||||
vault read transit/keys/sops-state-NAMESPACE
|
||||
```
|
||||
|
||||
### Step 5: Notify User
|
||||
|
||||
Tell the user to share these onboarding instructions with the new user:
|
||||
- K8s Portal: `https://k8s-portal.viktorbarzin.me/onboarding?role=namespace-owner`
|
||||
- README: `https://github.com/ViktorBarzin/infra#new-user-onboarding`
|
||||
|
||||
**Web dashboard access** (auto-login, no token paste): the `rbac` stack
|
||||
auto-creates a `dashboard-<user>` SA + token for every namespace-owner
|
||||
(`dashboard-sa.tf`), and the **k8s-dashboard** stack's token-injector maps the
|
||||
user's Authentik identity → that token (`dashboard_injector.tf`, auto-derived
|
||||
from `k8s_users`). The new user just logs into `https://k8s.viktorbarzin.me` and
|
||||
lands in the dashboard scoped to their namespace (`admin` on their namespace +
|
||||
read-only on the namespace list & nodes for nav — no cross-tenant resource reads).
|
||||
|
||||
> **Apply order for a new namespace-owner:** after the vault/rbac/woodpecker
|
||||
> applies above, ALSO `cd stacks/k8s-dashboard && ../../scripts/tg apply` so the
|
||||
> injector map picks up the new user. (Manual token fallback:
|
||||
> `kubectl -n NAMESPACE get secret dashboard-USERNAME-token -o jsonpath='{.data.token}' | base64 -d`.)
|
||||
> Seamless OIDC SSO is built but blocked — see
|
||||
> `docs/plans/2026-06-04-k8s-dashboard-sso-design.md` §12.
|
||||
|
||||
> **Auto-login works only for the user's `k8s_users` HOME namespace.** The
|
||||
> dashboard injects the user's `dashboard-<user>` SA token, which the `rbac`
|
||||
> stack binds to `admin` on their home namespace only. If their workload lives
|
||||
> in a DIFFERENT / pre-existing namespace (e.g. gheorghe's app is in `novelapp`,
|
||||
> not his home `vabbit81`), that namespace's stack must ALSO grant their
|
||||
> **dashboard SA** — `kind: ServiceAccount, name: dashboard-<user>, namespace:
|
||||
> <home-ns>` — not just their OIDC `User` email (the dashboard uses the SA, and
|
||||
> apiserver OIDC is blocked). See `stacks/novelapp/main.tf` `novelapp_owner_vabbit81`
|
||||
> for the pattern (two subjects: User + SA). Best practice: set the user's
|
||||
> `k8s_users` namespace to where their workload actually runs, so the home-ns
|
||||
> auto-path covers them with no extra binding.
|
||||
|
||||
The user can decrypt their stack's state with:
|
||||
```bash
|
||||
vault login -method=oidc # authenticates via Authentik SSO
|
||||
scripts/state-sync decrypt NAMESPACE # decrypts only their stack
|
||||
```
|
||||
|
||||
## What Gets Auto-Generated
|
||||
|
||||
| Resource | Stack | Driven by |
|
||||
|----------|-------|-----------|
|
||||
| Kubernetes namespace | vault | `namespaces` list |
|
||||
| Vault policy (`namespace-owner-{user}`) | vault | user key |
|
||||
| Vault identity entity + OIDC alias | vault | user email |
|
||||
| K8s deployer Role + Vault K8s role | vault | `namespaces` list |
|
||||
| **SOPS Transit key** (`sops-state-{ns}`) | vault | `namespaces` list |
|
||||
| **SOPS Vault policy** (`sops-user-{user}`) | vault | user key + namespaces |
|
||||
| **SOPS identity group** (`sops-{user}`) | vault | user key |
|
||||
| **SOPS group alias** (maps Authentik group) | vault | user key |
|
||||
| RBAC RoleBinding (namespace admin) | rbac | `namespaces` list |
|
||||
| RBAC ClusterRoleBinding (cluster read-only) | rbac | user role |
|
||||
| ResourceQuota | rbac | `quota` object |
|
||||
| TLS secret in namespace | rbac | `namespaces` list |
|
||||
| Cloudflare DNS records | cloudflared | `domains` list |
|
||||
| Woodpecker admin access | woodpecker | user key |
|
||||
|
||||
## Checklist (Manual Flow)
|
||||
|
||||
- [ ] Authentik: user added to `kubernetes-namespace-owners` group
|
||||
- [ ] Authentik: user added to `sops-USERNAME` group (for SOPS state decrypt)
|
||||
- [ ] Authentik: user added to `Headscale Users` group (if VPN needed)
|
||||
- [ ] Vault KV: `k8s_users` entry added to `secret/platform`
|
||||
- [ ] Vault stack applied — namespace + policy + identity + deployer role + SOPS Transit key + SOPS policy + SOPS group created
|
||||
- [ ] RBAC stack applied — RBAC + quota + TLS created
|
||||
- [ ] Woodpecker stack applied — admin list updated
|
||||
- [ ] Verification: namespace, quota, policies (namespace-owner + sops-user), deployer role, Transit key all confirmed
|
||||
- [ ] User notified with onboarding link
|
||||
170
.claude/skills/archived/authentik-oidc-kubernetes/SKILL.md
Normal file
170
.claude/skills/archived/authentik-oidc-kubernetes/SKILL.md
Normal file
|
|
@ -0,0 +1,170 @@
|
|||
---
|
||||
name: authentik-oidc-kubernetes
|
||||
description: |
|
||||
Configure Authentik as OIDC provider for Kubernetes API server authentication.
|
||||
Use when: (1) setting up OIDC auth for kubectl with Authentik, (2) kube-apiserver
|
||||
rejects OIDC tokens with "oidc: email not verified", (3) JWKS endpoint returns
|
||||
empty {} despite provider being configured, (4) kubelogin fails with "claim not
|
||||
present" for email, (5) redirect_uri mismatch errors during kubelogin browser auth,
|
||||
(6) kube-apiserver static pod manifest changes don't take effect after restart.
|
||||
Covers all gotchas discovered when integrating Authentik 2025.10.x with Kubernetes
|
||||
1.34.x using kubelogin (int128/kubelogin).
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Authentik OIDC for Kubernetes API Authentication
|
||||
|
||||
## Problem
|
||||
Setting up Authentik as an OIDC identity provider for Kubernetes kubectl access
|
||||
involves multiple non-obvious pitfalls that cause silent failures at different
|
||||
stages of the authentication flow.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Setting up multi-user kubectl access with OIDC
|
||||
- Using Authentik as the identity provider and kubelogin (int128/kubelogin) as the kubectl plugin
|
||||
- Any of these errors:
|
||||
- `oidc: email not verified`
|
||||
- `oidc: parse username claims "email": claim not present`
|
||||
- `The request fails due to a missing, invalid, or mismatching redirection URI`
|
||||
- JWKS endpoint (`/application/o/<app>/jwks/`) returns `{}`
|
||||
- `Unauthorized` after successful browser login
|
||||
|
||||
## Solution
|
||||
|
||||
### Gotcha 1: Signing Key Must Be Assigned
|
||||
|
||||
Authentik's OAuth2 provider does NOT assign a signing key by default. Without it,
|
||||
the JWKS endpoint returns `{}` and kube-apiserver can't validate tokens.
|
||||
|
||||
**Fix:** Assign a signing key (e.g., "authentik Self-signed Certificate") to the
|
||||
OAuth2 provider:
|
||||
```python
|
||||
# Via Django shell (kubectl exec into authentik server pod)
|
||||
from authentik.providers.oauth2.models import OAuth2Provider
|
||||
from authentik.crypto.models import CertificateKeyPair
|
||||
|
||||
provider = OAuth2Provider.objects.get(name='kubernetes')
|
||||
cert = CertificateKeyPair.objects.filter(name='authentik Self-signed Certificate').first()
|
||||
provider.signing_key = cert
|
||||
provider.save()
|
||||
```
|
||||
|
||||
Or via API:
|
||||
```bash
|
||||
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
|
||||
"$AUTHENTIK_URL/api/v3/providers/oauth2/<pk>/" \
|
||||
-d '{"signing_key": "<certificate-keypair-uuid>"}'
|
||||
```
|
||||
|
||||
### Gotcha 2: Default Email Mapping Sets `email_verified: False`
|
||||
|
||||
Authentik's built-in email scope mapping hardcodes `email_verified: False`:
|
||||
```python
|
||||
return {
|
||||
"email": request.user.email,
|
||||
"email_verified": False # <-- This causes kube-apiserver to reject the token
|
||||
}
|
||||
```
|
||||
|
||||
kube-apiserver requires `email_verified: true` by default.
|
||||
|
||||
**Fix:** Create a custom scope mapping with `email_verified: True` and assign it
|
||||
to the provider instead of the default:
|
||||
```python
|
||||
from authentik.providers.oauth2.models import OAuth2Provider, ScopeMapping
|
||||
|
||||
# Create custom mapping
|
||||
mapping, _ = ScopeMapping.objects.get_or_create(
|
||||
name='Kubernetes Email (verified)',
|
||||
defaults={
|
||||
'scope_name': 'email',
|
||||
'expression': 'return {"email": request.user.email, "email_verified": True}'
|
||||
}
|
||||
)
|
||||
|
||||
# Replace default email mapping on the provider
|
||||
provider = OAuth2Provider.objects.get(name='kubernetes')
|
||||
default_email = ScopeMapping.objects.filter(
|
||||
managed='goauthentik.io/providers/oauth2/scope-email'
|
||||
).first()
|
||||
if default_email:
|
||||
provider.property_mappings.remove(default_email)
|
||||
provider.property_mappings.add(mapping)
|
||||
```
|
||||
|
||||
### Gotcha 3: kubelogin Needs Extra Scopes
|
||||
|
||||
By default, kubelogin only requests the `openid` scope. The token will lack
|
||||
`email` and `groups` claims, causing:
|
||||
```
|
||||
oidc: parse username claims "email": claim not present
|
||||
```
|
||||
|
||||
**Fix:** Add `--oidc-extra-scope` flags to the kubeconfig exec plugin:
|
||||
```yaml
|
||||
users:
|
||||
- name: oidc-user
|
||||
user:
|
||||
exec:
|
||||
command: kubectl
|
||||
args:
|
||||
- oidc-login
|
||||
- get-token
|
||||
- --oidc-issuer-url=https://authentik.example.com/application/o/kubernetes/
|
||||
- --oidc-client-id=kubernetes
|
||||
- --oidc-extra-scope=email # Required!
|
||||
- --oidc-extra-scope=profile
|
||||
- --oidc-extra-scope=groups
|
||||
```
|
||||
|
||||
### Gotcha 4: Redirect URIs Must Use Regex Mode
|
||||
|
||||
kubelogin picks a random available port (tries 8000, 18000, then random).
|
||||
Strict redirect URI matching like `http://localhost:8000/callback` will fail
|
||||
when kubelogin uses a different port.
|
||||
|
||||
**Fix:** Use regex matching in the Authentik provider:
|
||||
```json
|
||||
{
|
||||
"redirect_uris": [
|
||||
{"matching_mode": "regex", "url": "http://localhost:.*"},
|
||||
{"matching_mode": "regex", "url": "http://127\\.0\\.0\\.1:.*"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Gotcha 5: Property Mappings API Endpoint Changed
|
||||
|
||||
In Authentik 2025.10.x, scope mappings are at:
|
||||
- `propertymappings/provider/scope/` (new, correct)
|
||||
- NOT `propertymappings/scope/` (old, returns 405 Method Not Allowed on POST)
|
||||
|
||||
### Gotcha 6: Static Pod Manifest Changes Need Full Cycle
|
||||
|
||||
See skill: `kubelet-static-pod-manifest-update` for the full restart procedure.
|
||||
|
||||
## Verification
|
||||
|
||||
After all fixes:
|
||||
```bash
|
||||
# 1. JWKS has a key
|
||||
curl -s https://authentik.example.com/application/o/kubernetes/jwks/ | jq '.keys | length'
|
||||
# Expected: 1 (or more)
|
||||
|
||||
# 2. Test auth
|
||||
KUBECONFIG=/path/to/oidc-kubeconfig kubectl get namespaces
|
||||
# Expected: browser opens, login, namespaces returned
|
||||
|
||||
# 3. Check API server logs for success
|
||||
ssh master "sudo kubectl logs -n kube-system kube-apiserver-* | grep oidc | tail -5"
|
||||
# Expected: no "Unable to authenticate" errors
|
||||
```
|
||||
|
||||
## Notes
|
||||
- The OAuth2 provider should use `client_type: public` (no client secret needed for kubelogin)
|
||||
- Set `sub_mode: user_email` so the OIDC subject matches the RBAC binding
|
||||
- Set `include_claims_in_id_token: true` for the token to contain claims directly
|
||||
- Use `issuer_mode: per_provider` for a clean issuer URL
|
||||
- RBAC ClusterRoleBindings should match on the user's email (the `--oidc-username-claim=email` value)
|
||||
297
.claude/skills/archived/authentik/SKILL.md
Normal file
297
.claude/skills/archived/authentik/SKILL.md
Normal file
|
|
@ -0,0 +1,297 @@
|
|||
---
|
||||
name: authentik
|
||||
description: |
|
||||
Manage the Authentik identity provider via its REST API. Use when:
|
||||
(1) User asks to create, update, or delete users in Authentik,
|
||||
(2) User asks to manage groups or group memberships,
|
||||
(3) User asks to create a new OAuth2/OIDC application or provider,
|
||||
(4) User asks to protect a service with forward auth (Authentik + Traefik),
|
||||
(5) User asks about SSO, single sign-on, authentication, or identity,
|
||||
(6) User asks to manage Authentik flows, stages, or policies,
|
||||
(7) User asks to configure social login (Google, GitHub, Facebook),
|
||||
(8) User asks about OIDC for Kubernetes or who has access to what,
|
||||
(9) User deploys a new service that needs authentication.
|
||||
Authentik v2025.10.3 running in Kubernetes, managed via REST API.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Authentik Identity Provider Management
|
||||
|
||||
## Overview
|
||||
- **URL**: `https://authentik.viktorbarzin.me`
|
||||
- **Admin UI**: `https://authentik.viktorbarzin.me/if/admin/`
|
||||
- **API Base**: `https://authentik.viktorbarzin.me/api/v3/`
|
||||
- **API Docs**: `https://authentik.viktorbarzin.me/api/v3/docs/`
|
||||
- **Helm Chart**: authentik v2025.10.3
|
||||
- **Namespace**: `authentik`
|
||||
|
||||
## API Access
|
||||
|
||||
### Getting the Token
|
||||
The API token is stored in `terraform.tfvars` (git-crypt encrypted):
|
||||
```bash
|
||||
AUTHENTIK_TOKEN=$(grep authentik_api_token terraform.tfvars | cut -d'"' -f2)
|
||||
```
|
||||
|
||||
### Making API Calls
|
||||
```bash
|
||||
# Generic pattern
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/<endpoint>/"
|
||||
|
||||
# With JSON body (POST/PATCH/PUT)
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/<endpoint>/" \
|
||||
-d '{"key": "value"}'
|
||||
```
|
||||
|
||||
### Verify Token Works
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/me/" | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Key API Endpoints
|
||||
|
||||
| Endpoint | Methods | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `core/users/` | GET, POST | List/create users |
|
||||
| `core/users/{id}/` | GET, PATCH, DELETE | Get/update/delete user |
|
||||
| `core/groups/` | GET, POST | List/create groups |
|
||||
| `core/groups/{pk}/` | GET, PATCH, DELETE | Get/update/delete group |
|
||||
| `core/applications/` | GET, POST | List/create applications |
|
||||
| `core/tokens/` | GET, POST | List/create tokens |
|
||||
| `core/tokens/{identifier}/view_key/` | GET | View token secret key |
|
||||
| `providers/all/` | GET | List all providers |
|
||||
| `providers/oauth2/` | GET, POST | OAuth2/OIDC providers |
|
||||
| `providers/proxy/` | GET, POST | Proxy providers (forward auth) |
|
||||
| `flows/instances/` | GET | List flows |
|
||||
| `stages/all/` | GET | List stages |
|
||||
| `sources/all/` | GET | List sources (social login) |
|
||||
| `outposts/instances/` | GET | List outposts |
|
||||
| `propertymappings/provider/scope/` | GET, POST | OIDC scope mappings |
|
||||
| `rbac/roles/` | GET | List roles |
|
||||
|
||||
## Common Operations
|
||||
|
||||
### List All Users
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/?page_size=50" | \
|
||||
python3 -c "
|
||||
import json,sys
|
||||
for u in json.load(sys.stdin)['results']:
|
||||
groups=[g['name'] for g in u.get('groups_obj',[])]
|
||||
print(f\" {u['username']:<40} {u['name']:<30} groups={groups}\")
|
||||
"
|
||||
```
|
||||
|
||||
### Create a New User
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/users/" \
|
||||
-d '{
|
||||
"username": "user@example.com",
|
||||
"name": "Full Name",
|
||||
"email": "user@example.com",
|
||||
"is_active": true,
|
||||
"type": "internal",
|
||||
"path": "users"
|
||||
}'
|
||||
```
|
||||
|
||||
### Add User to Group
|
||||
```bash
|
||||
# First get the group to find current users
|
||||
GROUP_PK="<group-uuid>"
|
||||
CURRENT_USERS=$(curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" | \
|
||||
python3 -c "import json,sys; print(json.load(sys.stdin)['users'])")
|
||||
|
||||
# Then PATCH with the updated user list (add new user pk)
|
||||
curl -s -X PATCH \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/$GROUP_PK/" \
|
||||
-d '{"users": [<existing_pks>, <new_pk>]}'
|
||||
```
|
||||
|
||||
### Create a New Group
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/groups/" \
|
||||
-d '{
|
||||
"name": "My New Group",
|
||||
"is_superuser": false,
|
||||
"parent": "<parent-group-pk-or-null>"
|
||||
}'
|
||||
```
|
||||
|
||||
### Create OAuth2/OIDC Application (Full Flow)
|
||||
|
||||
**Step 1: Create the OAuth2 Provider**
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/providers/oauth2/" \
|
||||
-d '{
|
||||
"name": "Provider for myapp",
|
||||
"authorization_flow": "<flow-pk>",
|
||||
"invalidation_flow": "<invalidation-flow-pk>",
|
||||
"client_type": "confidential",
|
||||
"client_id": "<generated-or-custom>",
|
||||
"client_secret": "<generated-or-custom>",
|
||||
"redirect_uris": "https://myapp.viktorbarzin.me/callback",
|
||||
"property_mappings": ["<scope-mapping-pks>"],
|
||||
"signing_key": "<signing-key-pk>"
|
||||
}'
|
||||
```
|
||||
|
||||
**Step 2: Create the Application**
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/applications/" \
|
||||
-d '{
|
||||
"name": "My App",
|
||||
"slug": "myapp",
|
||||
"provider": <provider-pk-from-step-1>,
|
||||
"meta_launch_url": "https://myapp.viktorbarzin.me"
|
||||
}'
|
||||
```
|
||||
|
||||
### List Applications
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/applications/?page_size=50" | \
|
||||
python3 -c "
|
||||
import json,sys
|
||||
for a in json.load(sys.stdin)['results']:
|
||||
ptype = a.get('provider_obj',{}).get('verbose_name','N/A')
|
||||
print(f\" {a['name']:<30} slug={a['slug']:<25} provider={ptype}\")
|
||||
"
|
||||
```
|
||||
|
||||
### Create a Non-Expiring API Token
|
||||
```bash
|
||||
# Create token
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/tokens/" \
|
||||
-d '{
|
||||
"identifier": "my-token-name",
|
||||
"intent": "api",
|
||||
"expiring": false,
|
||||
"description": "Description here"
|
||||
}'
|
||||
|
||||
# Retrieve the key
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/core/tokens/my-token-name/view_key/"
|
||||
```
|
||||
|
||||
## Important Reference UUIDs
|
||||
|
||||
### Authorization Flows
|
||||
| Flow | Slug | Use For |
|
||||
|------|------|---------|
|
||||
| Authorize Application (explicit consent) | `default-provider-authorization-explicit-consent` | Apps that should show consent screen |
|
||||
| Authorize Application (implicit consent) | `default-provider-authorization-implicit-consent` | Internal/trusted apps, auto-redirect |
|
||||
| Logout | `default-invalidation-flow` | Invalidation/logout flow |
|
||||
|
||||
### Common Property Mappings (OIDC Scopes)
|
||||
These are the standard scope mappings used by most providers:
|
||||
- `60e33a8c-66a2-414f-840c-b13012b4d4bd` — openid
|
||||
- `1f51c659-f13b-4ad4-ba89-70458ef88e9c` — email
|
||||
- `4c0bf430-7f74-4216-b9d7-23703ab544ba` — profile
|
||||
|
||||
### Login Sources
|
||||
| Source | Slug | Matching Mode |
|
||||
|--------|------|---------------|
|
||||
| Google | `google` | identifier |
|
||||
| GitHub | `github` | email_link |
|
||||
| Facebook | `facebook` | email_link |
|
||||
|
||||
## Protecting a Service with Forward Auth
|
||||
|
||||
To protect a service via Authentik + Traefik forward auth:
|
||||
|
||||
1. In the service's Terraform module, set `protected = true` in the `ingress_factory` call
|
||||
2. This adds the `authentik-forward-auth` Traefik middleware
|
||||
3. Unauthenticated users get redirected to the Authentik login page
|
||||
4. After login, these headers are forwarded to the service:
|
||||
- `X-authentik-username`
|
||||
- `X-authentik-uid`
|
||||
- `X-authentik-email`
|
||||
- `X-authentik-name`
|
||||
- `X-authentik-groups`
|
||||
|
||||
## Invitation Management
|
||||
|
||||
### Create Invitation
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/" \
|
||||
-d '{
|
||||
"name": "invite-slug-name",
|
||||
"single_use": true,
|
||||
"fixed_data": {"group": "Target Group Name"},
|
||||
"flow": "<invitation-enrollment-flow-pk>"
|
||||
}'
|
||||
# Returns PK which is the itoken
|
||||
# Link: https://authentik.viktorbarzin.me/if/flow/invitation-enrollment/?itoken=<pk>
|
||||
```
|
||||
|
||||
### List Invitations
|
||||
```bash
|
||||
curl -s -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/?page_size=50"
|
||||
```
|
||||
|
||||
### Delete Invitation
|
||||
```bash
|
||||
curl -s -X DELETE -H "Authorization: Bearer $AUTHENTIK_TOKEN" \
|
||||
"https://authentik.viktorbarzin.me/api/v3/stages/invitation/invitations/<pk>/"
|
||||
```
|
||||
|
||||
### Helper Script
|
||||
Use `.claude/scripts/authentik-invite.sh` for invitation management:
|
||||
```bash
|
||||
./authentik-invite.sh create "Group Name" [--days N]
|
||||
./authentik-invite.sh assign <username> "Group Name"
|
||||
./authentik-invite.sh list
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
- OAuth source `enrollment_flow` is set to `invitation-enrollment` -- new social login users require invitation
|
||||
- Source updates require Django ORM (PATCH not supported on `sources/oauth/<slug>/`)
|
||||
- Invitation `name` field must be a slug (letters, numbers, hyphens, underscores)
|
||||
|
||||
## Gotchas
|
||||
|
||||
1. **API pagination**: All list endpoints return paginated results. Use `?page_size=50` or check `pagination.next` for more pages.
|
||||
2. **Group user updates**: PATCH to groups replaces the entire user list — always fetch current users first, then append.
|
||||
3. **Provider property mappings**: Must reference existing scope mapping UUIDs. Query `propertymappings/provider/scope/` to find them.
|
||||
4. **Signing key for OIDC**: Must assign a signing key to OAuth2 providers or JWKS endpoint returns empty `{}`.
|
||||
5. **Email verified claim**: Default email scope mapping sets `email_verified: False`. For Kubernetes OIDC, create a custom mapping that returns `True`.
|
||||
6. **Token identifier uniqueness**: Token identifiers must be unique across the entire instance.
|
||||
|
||||
## Notes
|
||||
- Authentik is classified as DEFCON Level 1 (Critical) — handle with care
|
||||
- Changes to Authentik configuration (Helm chart, PgBouncer, etc.) must go through Terraform
|
||||
- API-level changes (users, groups, applications) are fine to make directly via the API
|
||||
- The embedded outpost auto-discovers providers assigned to it
|
||||
- See also: `ingress-factory-migration` skill for protecting services
|
||||
175
.claude/skills/archived/bluestacks-burp-interception/SKILL.md
Normal file
175
.claude/skills/archived/bluestacks-burp-interception/SKILL.md
Normal file
|
|
@ -0,0 +1,175 @@
|
|||
---
|
||||
name: bluestacks-burp-interception
|
||||
description: |
|
||||
Intercept Android app HTTPS traffic using BlueStacks and Burp Suite on macOS.
|
||||
Use when: (1) Need to analyze Android app API calls, (2) App ignores HTTP proxy,
|
||||
(3) App uses SSL pinning that blocks interception, (4) Need to install Burp CA
|
||||
as system certificate. Covers ADB setup, proxy configuration, Zygisk SSL unpinning,
|
||||
and Magisk trustusercerts module for system CA installation.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-01-24
|
||||
---
|
||||
|
||||
# BlueStacks + Burp Suite HTTPS Traffic Interception
|
||||
|
||||
## Problem
|
||||
You want to intercept HTTPS traffic from an Android app running in BlueStacks to analyze
|
||||
API calls, but the app either ignores the proxy or uses SSL certificate pinning.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Running BlueStacks on macOS with Burp Suite
|
||||
- App traffic not appearing in Burp Suite
|
||||
- App crashes or refuses to connect when proxy is set
|
||||
- Need to bypass SSL pinning for security testing/research
|
||||
|
||||
## Prerequisites
|
||||
- BlueStacks with Magisk (kitsune variant) and root enabled
|
||||
- Zygisk-SSL-Unpinning module installed
|
||||
- trustusercerts Magisk module installed
|
||||
- Android SDK installed (for ADB)
|
||||
- Burp Suite running on port 8080
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Connect ADB to BlueStacks
|
||||
|
||||
```bash
|
||||
# ADB location on macOS (Android SDK)
|
||||
ADB=~/Library/Android/sdk/platform-tools/adb
|
||||
|
||||
# Connect to BlueStacks
|
||||
$ADB connect localhost:5555
|
||||
|
||||
# Verify connection
|
||||
$ADB devices
|
||||
# Should show: emulator-5554 or localhost:5555
|
||||
```
|
||||
|
||||
Note: BlueStacks runs **arm64-v8a** (not x86 as you might expect).
|
||||
|
||||
### Step 2: Set HTTP Proxy
|
||||
|
||||
Use your Mac's WiFi IP address (not 10.0.2.2 or localhost):
|
||||
|
||||
```bash
|
||||
# Get Mac WiFi IP
|
||||
IP=$(ipconfig getifaddr en0)
|
||||
|
||||
# Set proxy (Burp default port 8080)
|
||||
$ADB shell settings put global http_proxy ${IP}:8080
|
||||
|
||||
# Verify
|
||||
$ADB shell settings get global http_proxy
|
||||
|
||||
# Disable proxy when done
|
||||
$ADB shell settings put global http_proxy :0
|
||||
```
|
||||
|
||||
### Step 3: Configure SSL Unpinning for Target App
|
||||
|
||||
```bash
|
||||
# Find app package name
|
||||
$ADB shell pm list packages | grep <keyword>
|
||||
|
||||
# Edit config
|
||||
$ADB shell "su -c 'cat > /data/local/tmp/zyg.ssl/config.json << EOF
|
||||
{
|
||||
\"targets\": [
|
||||
{
|
||||
\"pkg_name\" : \"com.example.app\",
|
||||
\"enable\": true,
|
||||
\"start_safe\": true,
|
||||
\"start_delay\": 1000
|
||||
}
|
||||
]
|
||||
}
|
||||
EOF'"
|
||||
|
||||
# Restart the app
|
||||
$ADB shell am force-stop com.example.app
|
||||
$ADB shell monkey -p com.example.app -c android.intent.category.LAUNCHER 1
|
||||
|
||||
# Verify SSL unpinning is active
|
||||
$ADB shell "logcat -d | grep -i ZygiskSSL | tail -10"
|
||||
# Should show: "App detected: com.example.app" and "[*] SSL UNPINNING [#]"
|
||||
```
|
||||
|
||||
### Step 4: Install Burp CA as System Certificate
|
||||
|
||||
```bash
|
||||
# Download Burp CA cert
|
||||
curl -x http://127.0.0.1:8080 http://burp/cert -o /tmp/burp-cert.der
|
||||
|
||||
# Convert to PEM
|
||||
openssl x509 -inform DER -in /tmp/burp-cert.der -out /tmp/burp-cert.pem
|
||||
|
||||
# Get hash for Android cert store naming
|
||||
HASH=$(openssl x509 -inform PEM -subject_hash_old -in /tmp/burp-cert.pem | head -1)
|
||||
cp /tmp/burp-cert.pem /tmp/${HASH}.0
|
||||
|
||||
# Push to device
|
||||
$ADB push /tmp/${HASH}.0 /sdcard/
|
||||
|
||||
# Install via trustusercerts Magisk module
|
||||
$ADB shell "su -c 'cp /sdcard/${HASH}.0 /data/adb/modules/trustusercerts/system/etc/security/cacerts/'"
|
||||
$ADB shell "su -c 'chmod 644 /data/adb/modules/trustusercerts/system/etc/security/cacerts/${HASH}.0'"
|
||||
|
||||
# Reboot required for Magisk overlay
|
||||
$ADB shell "su -c 'reboot'"
|
||||
|
||||
# After reboot, verify cert is in system store
|
||||
$ADB shell "su -c 'ls /system/etc/security/cacerts/${HASH}.0'"
|
||||
```
|
||||
|
||||
### Step 5: Test Interception
|
||||
|
||||
1. Re-enable proxy after reboot: `$ADB shell settings put global http_proxy ${IP}:8080`
|
||||
2. Launch target app
|
||||
3. Check Burp Suite → Proxy → HTTP history for requests
|
||||
|
||||
## Verification
|
||||
|
||||
- Proxy set: `adb shell settings get global http_proxy` returns `<ip>:8080`
|
||||
- SSL unpinning active: `logcat | grep ZygiskSSL` shows "SSL UNPINNING"
|
||||
- Burp CA installed: `ls /system/etc/security/cacerts/<hash>.0` exists
|
||||
- Traffic visible in Burp Suite HTTP history
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Cause | Fix |
|
||||
|---------|-------|-----|
|
||||
| No traffic in Burp | Proxy not set | Check `settings get global http_proxy` |
|
||||
| App shows SSL error | Cert not installed | Verify cert in system store, reboot |
|
||||
| SSL unpinning not working | Config not loaded | Force-stop app, check config.json syntax |
|
||||
| ADB connection refused | BlueStacks ADB disabled | Enable in BlueStacks Settings → Advanced |
|
||||
| Wrong cert hash | Using wrong openssl flag | Use `subject_hash_old` not `subject_hash` |
|
||||
|
||||
## Notes
|
||||
|
||||
- BlueStacks runs arm64-v8a, so Zygisk modules need arm64 support
|
||||
- The trustusercerts module copies certs at boot via Magisk overlay
|
||||
- System partition is read-only; use Magisk modules instead of direct mounting
|
||||
- Burp cert hash is typically `9a5ba575` but verify for your instance
|
||||
- Some apps may use additional protections (root detection, Frida detection)
|
||||
|
||||
## Quick Reference
|
||||
|
||||
```bash
|
||||
# Set proxy
|
||||
adb shell settings put global http_proxy <ip>:8080
|
||||
|
||||
# Disable proxy
|
||||
adb shell settings put global http_proxy :0
|
||||
|
||||
# Check SSL unpinning logs
|
||||
adb shell "logcat -d | grep -i ZygiskSSL"
|
||||
|
||||
# Force restart app
|
||||
adb shell am force-stop <package> && adb shell monkey -p <package> -c android.intent.category.LAUNCHER 1
|
||||
```
|
||||
|
||||
## References
|
||||
- [Zygisk-SSL-Unpinning](https://github.com/m0szy/Zygisk-SSL-Unpinning)
|
||||
- [MagiskTrustUserCerts](https://github.com/NVISOsecurity/MagiskTrustUserCerts)
|
||||
- [Burp Suite Documentation](https://portswigger.net/burp/documentation)
|
||||
|
|
@ -0,0 +1,189 @@
|
|||
---
|
||||
name: clickhouse-k8s-nfs-system-log-bloat
|
||||
description: |
|
||||
Fix for ClickHouse consuming excessive CPU (500m-1000m+) on Kubernetes when running on
|
||||
NFS storage, caused by unbounded system log table growth triggering continuous background
|
||||
merges. Use when: (1) ClickHouse burns ~1 CPU core with no active user queries,
|
||||
(2) system.merges shows constant merge activity on system.metric_log or system.trace_log,
|
||||
(3) system log tables (metric_log, trace_log, text_log, asynchronous_metric_log) have
|
||||
grown to gigabytes while actual user data is tiny, (4) ClickHouse crashes with exit code
|
||||
76 (loadOutdatedDataParts SIGSEGV), (5) attempting to mount custom config.d XML via
|
||||
Kubernetes ConfigMap causes exit code 36 (BAD_ARGUMENTS) crashes. Also covers why
|
||||
ClickHouse's MergeTree engine performs poorly on NFS and the CronJob workaround for
|
||||
system log truncation.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# ClickHouse on Kubernetes/NFS: System Log Bloat & CPU Overhead
|
||||
|
||||
## Problem
|
||||
|
||||
ClickHouse deployed on Kubernetes with NFS storage consumes ~1 CPU core continuously,
|
||||
even when actual user queries are negligible. The CPU is consumed by background merge
|
||||
operations on system log tables that grow unboundedly with no default TTL.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- ClickHouse pod using 500m-1000m+ CPU with no active user queries
|
||||
- `SELECT * FROM system.processes` shows only diagnostic queries
|
||||
- `SELECT * FROM system.merges` shows constant merge activity on `system.metric_log`
|
||||
- System log tables have grown to gigabytes:
|
||||
- `system.trace_log`: 5+ GiB, 200M+ rows
|
||||
- `system.text_log`: 3+ GiB, 90M+ rows
|
||||
- `system.metric_log`: 1+ GiB with 80-100+ active parts (healthy is <20)
|
||||
- `system.asynchronous_metric_log`: 500+ MiB, 1B+ rows
|
||||
- Actual user data (e.g., `clickhouse.events`) is only kilobytes
|
||||
- ClickHouse crashes periodically with exit code 76 (`loadOutdatedDataParts` SIGSEGV)
|
||||
- Data directory is on NFS (e.g., `/mnt/main/clickhouse`)
|
||||
|
||||
## Root Cause
|
||||
|
||||
Two compounding issues:
|
||||
|
||||
1. **No TTL on system log tables**: ClickHouse system tables (`metric_log`, `trace_log`,
|
||||
`text_log`, `asynchronous_metric_log`, `query_log`, `part_log`) have no default
|
||||
retention policy and grow indefinitely.
|
||||
|
||||
2. **NFS amplifies merge overhead**: ClickHouse's MergeTree engine relies on background
|
||||
merge operations that involve heavy sequential I/O. NFS latency makes merges 10-100x
|
||||
slower than local disk, creating a feedback loop:
|
||||
- Slow merges → parts accumulate faster than they can be merged
|
||||
- More parts → more merge operations spawned
|
||||
- More merges → more CPU for decompression/recompression while waiting on NFS I/O
|
||||
|
||||
## Solution
|
||||
|
||||
### Immediate Fix: Truncate System Tables
|
||||
|
||||
```bash
|
||||
CH_POD=$(kubectl get pod -n <namespace> -l app=clickhouse -o jsonpath='{.items[0].metadata.name}')
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.metric_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.trace_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.text_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.query_log"
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query "TRUNCATE TABLE IF EXISTS system.part_log"
|
||||
```
|
||||
|
||||
This can take 30-60+ seconds per table on NFS due to part cleanup I/O.
|
||||
|
||||
### Permanent Fix: CronJob for Periodic Truncation
|
||||
|
||||
Add a Kubernetes CronJob that truncates system tables via the ClickHouse HTTP API:
|
||||
|
||||
```hcl
|
||||
resource "kubernetes_cron_job_v1" "clickhouse_truncate_logs" {
|
||||
metadata {
|
||||
name = "clickhouse-truncate-logs"
|
||||
namespace = "<namespace>"
|
||||
}
|
||||
spec {
|
||||
schedule = "0 */6 * * *"
|
||||
successful_jobs_history_limit = 1
|
||||
failed_jobs_history_limit = 1
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
container {
|
||||
name = "truncate"
|
||||
image = "curlimages/curl:8.12.1"
|
||||
command = ["sh", "-c", join(" && ", [
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.metric_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.trace_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.text_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.asynchronous_metric_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.query_log'",
|
||||
"curl -s 'http://clickhouse.<ns>.svc.cluster.local:8123/?user=default&password=<pw>' -d 'TRUNCATE TABLE IF EXISTS system.part_log'",
|
||||
"echo 'System logs truncated'"
|
||||
])]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### What Does NOT Work: Config.d XML Mount
|
||||
|
||||
**DO NOT** attempt to mount custom XML config files into `/etc/clickhouse-server/config.d/`
|
||||
via Kubernetes ConfigMap. Both approaches crash ClickHouse with exit code 36 (BAD_ARGUMENTS):
|
||||
|
||||
- **Full directory mount** (`mount_path = "/etc/clickhouse-server/config.d"`): Replaces
|
||||
the entire directory, deleting the built-in `docker_related_config.xml` that the
|
||||
entrypoint expects. Even if you include it in your ConfigMap, ClickHouse still crashes.
|
||||
|
||||
- **sub_path mount** (`sub_path = "custom.xml"`): Also crashes with exit code 36, even
|
||||
with minimal valid XML containing only `<background_pool_size>4</background_pool_size>`.
|
||||
|
||||
- Both `remove="1"` (to disable tables) and `<ttl>` (to set retention) config overrides
|
||||
crash with exit code 36.
|
||||
|
||||
This appears to be an issue with the `clickhouse/clickhouse-server:25.4.2` Docker image
|
||||
and how it preprocesses config at startup. The CronJob approach bypasses this entirely.
|
||||
|
||||
## Verification
|
||||
|
||||
After truncation, verify:
|
||||
|
||||
```bash
|
||||
# CPU should drop from ~900m to ~100m within minutes
|
||||
kubectl top pod -n <namespace> -l app=clickhouse
|
||||
|
||||
# No active merges
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT count() FROM system.merges"
|
||||
|
||||
# System tables should be small
|
||||
kubectl exec -n <namespace> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as size, sum(rows) as rows \
|
||||
FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC \
|
||||
FORMAT Pretty"
|
||||
```
|
||||
|
||||
## Diagnostic Commands
|
||||
|
||||
```bash
|
||||
# Check what's consuming CPU (merges vs queries)
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT * FROM system.merges FORMAT Pretty"
|
||||
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT query_id, elapsed, query FROM system.processes WHERE is_initial_query FORMAT Pretty"
|
||||
|
||||
# Check background pool config
|
||||
kubectl exec -n <ns> $CH_POD -- clickhouse-client --query \
|
||||
"SELECT name, value FROM system.server_settings \
|
||||
WHERE name IN ('background_pool_size', 'background_merges_mutations_concurrency_ratio') \
|
||||
FORMAT Pretty"
|
||||
|
||||
# Default is background_pool_size=16, concurrency_ratio=2 → up to 32 concurrent merges
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **Exit code 76**: ClickHouse crashes in `loadOutdatedDataParts()` when there are hundreds
|
||||
of outdated parts on NFS. The truncation CronJob prevents this by keeping tables small.
|
||||
|
||||
- **Exit code 36**: `BAD_ARGUMENTS` in ClickHouse. Triggered by config.d XML mounts in
|
||||
Kubernetes. Root cause unclear but reproducible across mount methods.
|
||||
|
||||
- **Default thread pools**: ClickHouse defaults to `background_pool_size=16` and
|
||||
`background_schedule_pool_size=512`, spawning 700+ threads even for a single-table
|
||||
workload. This overhead is unavoidable without config file changes.
|
||||
|
||||
- **NFS is fundamentally unsuitable** for ClickHouse's MergeTree engine. If data
|
||||
persistence is not critical (e.g., analytics data is small), consider `emptyDir` or
|
||||
local PV storage instead.
|
||||
|
||||
## See Also
|
||||
|
||||
- `k8s-nfs-mount-troubleshooting` — NFS mount failures and permission issues
|
||||
- `k8s-limitrange-oom-silent-kill` — LimitRange defaults causing OOM in ClickHouse containers
|
||||
145
.claude/skills/archived/coturn-k8s-without-hostnetwork/SKILL.md
Normal file
145
.claude/skills/archived/coturn-k8s-without-hostnetwork/SKILL.md
Normal file
|
|
@ -0,0 +1,145 @@
|
|||
---
|
||||
name: coturn-k8s-without-hostnetwork
|
||||
description: |
|
||||
Deploy coturn (TURN/STUN server) on Kubernetes without hostNetwork by using a
|
||||
narrow relay port range and MetalLB LoadBalancer service. Use when: (1) deploying
|
||||
a WebRTC relay server on k8s, (2) want coturn to run on any node (not pinned),
|
||||
(3) avoiding hostNetwork for better pod scheduling and multi-replica support,
|
||||
(4) need TURN for NAT traversal in WebRTC apps (video streaming, conferencing).
|
||||
Covers relay port range sizing, MetalLB IP sharing, ephemeral TURN credentials
|
||||
via HMAC-SHA1, and pfSense port forwarding.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# coturn on Kubernetes Without hostNetwork
|
||||
|
||||
## Problem
|
||||
TURN servers traditionally require hostNetwork because they relay media over a wide
|
||||
UDP port range (49152-65535). This pins the server to a single node, prevents rolling
|
||||
updates, and wastes cluster flexibility.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying a TURN/STUN server for WebRTC applications on Kubernetes
|
||||
- Want the TURN pod to be schedulable on any node
|
||||
- Need to avoid hostNetwork for better availability and scheduling
|
||||
|
||||
## Solution
|
||||
|
||||
### Key insight: Narrow the relay port range
|
||||
A home lab with ~20 concurrent WebRTC viewers needs ~40 relay ports (2 per viewer).
|
||||
Use 100 ports (49152-49252) instead of 16K. This makes it practical to expose via
|
||||
a K8s LoadBalancer service.
|
||||
|
||||
### Terraform module structure
|
||||
|
||||
```hcl
|
||||
locals {
|
||||
turn_port = 3478
|
||||
min_port = 49152
|
||||
max_port = 49252 # 100 ports — enough for ~50 concurrent streams
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "coturn" {
|
||||
spec {
|
||||
# No hostNetwork, no nodeSelector — runs anywhere
|
||||
template {
|
||||
spec {
|
||||
container {
|
||||
image = "coturn/coturn:latest"
|
||||
args = ["-c", "/etc/turnserver/turnserver.conf"]
|
||||
port {
|
||||
container_port = 3478
|
||||
protocol = "UDP"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "coturn" {
|
||||
metadata {
|
||||
annotations = {
|
||||
# Share an existing MetalLB IP to avoid consuming a new one
|
||||
"metallb.universe.tf/loadBalancerIPs" = "10.0.20.200"
|
||||
"metallb.universe.tf/allow-shared-ip" = "shared"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
type = "LoadBalancer"
|
||||
# Signaling port
|
||||
port {
|
||||
name = "turn-udp"
|
||||
port = 3478
|
||||
protocol = "UDP"
|
||||
}
|
||||
# Relay ports — dynamic block generates 100 port definitions
|
||||
dynamic "port" {
|
||||
for_each = range(49152, 49253)
|
||||
content {
|
||||
name = "relay-${port.value}"
|
||||
port = port.value
|
||||
target_port = port.value
|
||||
protocol = "UDP"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### coturn config (turnserver.conf)
|
||||
|
||||
```
|
||||
listening-port=3478
|
||||
fingerprint
|
||||
lt-cred-mech
|
||||
use-auth-secret
|
||||
static-auth-secret=YOUR_SECRET_HERE
|
||||
realm=yourdomain.com
|
||||
listening-ip=0.0.0.0
|
||||
min-port=49152
|
||||
max-port=49252
|
||||
no-multicast-peers
|
||||
no-cli
|
||||
```
|
||||
|
||||
### MetalLB IP sharing
|
||||
To reuse an existing MetalLB IP (e.g., the WireGuard/Shadowsocks shared IP):
|
||||
1. Add `metallb.universe.tf/allow-shared-ip: shared` to the coturn service
|
||||
2. The same annotation must exist on all other services sharing that IP
|
||||
3. **Port conflicts are not allowed** — verify no other service uses 3478 or 49152-49252
|
||||
4. After changing the IP annotation, **delete and recreate** the service — MetalLB won't reassign IPs on annotation changes alone
|
||||
|
||||
### Ephemeral TURN credentials
|
||||
coturn's `use-auth-secret` mode generates time-limited credentials via HMAC-SHA1:
|
||||
|
||||
```javascript
|
||||
const crypto = require('crypto');
|
||||
const TURN_SECRET = 'your-shared-secret';
|
||||
|
||||
function getTurnCredentials(name = 'user', ttl = 86400) {
|
||||
const timestamp = Math.floor(Date.now() / 1000) + ttl;
|
||||
const username = `${timestamp}:${name}`;
|
||||
const credential = crypto.createHmac('sha1', TURN_SECRET)
|
||||
.update(username).digest('base64');
|
||||
return { username, credential };
|
||||
}
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# STUN binding request (raw UDP probe)
|
||||
echo -ne '\x00\x01\x00\x00\x21\x12\xa4\x42\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' \
|
||||
| nc -u -w2 <METALLB_IP> 3478 | xxd | head -3
|
||||
# Response starting with 0101 = successful STUN binding response
|
||||
```
|
||||
|
||||
## Notes
|
||||
- 100 relay ports supports ~50 concurrent streams (2 ports per stream)
|
||||
- If you need more, increase `max_port` and add more ports to the service
|
||||
- coturn auto-detects pod IP — no need to set `relay-ip` or `external-ip` explicitly
|
||||
- For public access, add NAT port forwards on pfSense for UDP 3478 + 49152-49252
|
||||
- See also: `pfsense-nat-rule-creation` skill for adding the port forwards
|
||||
|
|
@ -0,0 +1,99 @@
|
|||
---
|
||||
name: crowdsec-agent-registration-failure
|
||||
description: |
|
||||
Fix CrowdSec agent pods stuck in CrashLoopBackOff after LAPI restart due to stale
|
||||
machine registrations. Use when: (1) CrowdSec agent init container fails with
|
||||
"user already exist" error during cscli lapi register, (2) agent pods show hundreds
|
||||
of init container restarts, (3) LAPI was restarted or redeployed but agents kept
|
||||
running with old credentials, (4) cscli machines list shows stale entries for
|
||||
current agent pod names. Covers deleting stale registrations to allow re-registration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# CrowdSec Agent Registration Failure
|
||||
|
||||
## Problem
|
||||
After a CrowdSec LAPI restart or redeployment, agent DaemonSet pods lose their
|
||||
credentials but LAPI retains the old machine registrations. When agents try to
|
||||
re-register with the same pod name, the `wait-for-lapi-and-register` init container
|
||||
fails with `user already exist`, causing CrashLoopBackOff with hundreds of restarts.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Agent init container logs show: `Error: cscli lapi register: api client register: api register ... user 'crowdsec-agent-xxxxx': user already exist`
|
||||
- Agent pods show status `CrashLoopBackOff` or `Init:CrashLoopBackOff` with many restarts
|
||||
- `kubectl describe pod` shows `BackOff restarting failed container wait-for-lapi-and-register`
|
||||
- LAPI pods were recently restarted or redeployed
|
||||
- `cscli machines list` on LAPI shows entries matching the stuck agent pod names
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify stuck agents
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec
|
||||
```
|
||||
Note the pod names that are in CrashLoopBackOff (e.g., `crowdsec-agent-jr5q7`).
|
||||
|
||||
### Step 2: Confirm the init container error
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config logs -n crowdsec <agent-pod> -c wait-for-lapi-and-register --tail=5
|
||||
```
|
||||
Should show `user already exist` error.
|
||||
|
||||
### Step 3: Find a running LAPI pod
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep lapi
|
||||
```
|
||||
|
||||
### Step 4: Delete stale machine registrations from LAPI
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config exec -n crowdsec <lapi-pod> -- cscli machines delete <agent-pod-name>
|
||||
```
|
||||
Repeat for each stuck agent.
|
||||
|
||||
### Step 5: Wait for agents to recover
|
||||
The agents are in CrashLoopBackOff with exponential backoff (up to 5 minutes). They'll
|
||||
automatically retry registration and succeed after the stale entry is deleted. This can
|
||||
take up to 5 minutes per agent depending on where they are in the backoff cycle.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# All agents should show Running status
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n crowdsec | grep agent
|
||||
# DaemonSet should show all pods READY
|
||||
kubectl --kubeconfig $(pwd)/config get ds -n crowdsec
|
||||
```
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Identify stuck agents
|
||||
$ kubectl get pods -n crowdsec | grep agent
|
||||
crowdsec-agent-jr5q7 0/1 CrashLoopBackOff 485 3d
|
||||
crowdsec-agent-jw76q 1/1 Running 8 3d
|
||||
crowdsec-agent-mtgxh 0/1 CrashLoopBackOff 483 3d
|
||||
crowdsec-agent-pfw2l 0/1 CrashLoopBackOff 481 3d
|
||||
|
||||
# Delete stale registrations
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-jr5q7
|
||||
level=info msg="machine 'crowdsec-agent-jr5q7' deleted successfully"
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-mtgxh
|
||||
$ kubectl exec -n crowdsec crowdsec-lapi-xxx -- cscli machines delete crowdsec-agent-pfw2l
|
||||
|
||||
# Wait ~5 minutes, then verify
|
||||
$ kubectl get pods -n crowdsec | grep agent
|
||||
crowdsec-agent-jr5q7 1/1 Running 1 3d
|
||||
crowdsec-agent-jw76q 1/1 Running 8 3d
|
||||
crowdsec-agent-mtgxh 1/1 Running 1 3d
|
||||
crowdsec-agent-pfw2l 1/1 Running 1 3d
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is a known limitation of the CrowdSec Helm chart — the init container registration
|
||||
script is not idempotent (it doesn't handle "already exists" by deleting and re-registering).
|
||||
- The `cscli machines list` output will show many historical stale entries from past
|
||||
DaemonSet rollouts. These are harmless but can be cleaned up if desired.
|
||||
- This issue also causes the CrowdSec blocklist import CronJob to fail, since it selects
|
||||
agent pods alphabetically and may pick a non-running one. Fixing the agents also fixes
|
||||
the blocklist import.
|
||||
- See also: `k8s-nfs-mount-troubleshooting` for other common pod startup failures.
|
||||
310
.claude/skills/archived/fastapi-svelte-gpu-webui/SKILL.md
Normal file
310
.claude/skills/archived/fastapi-svelte-gpu-webui/SKILL.md
Normal file
|
|
@ -0,0 +1,310 @@
|
|||
---
|
||||
name: fastapi-svelte-gpu-webui
|
||||
description: |
|
||||
Pattern for building web UIs for GPU-based CLI tools. Use when:
|
||||
(1) Wrapping a command-line tool with a web interface, (2) Building job queue
|
||||
systems for long-running GPU tasks, (3) Creating file upload/download workflows,
|
||||
(4) Need real-time progress updates via WebSocket, (5) Deploying to Kubernetes
|
||||
with GPU scheduling. Covers FastAPI backend, Svelte 5 frontend, NFS storage,
|
||||
and Terraform deployment.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-31
|
||||
---
|
||||
|
||||
# FastAPI + Svelte GPU WebUI Pattern
|
||||
|
||||
## Problem
|
||||
Many powerful tools are command-line only, making them inaccessible to non-technical
|
||||
users. Building a web UI requires handling file uploads, job queuing, progress tracking,
|
||||
and GPU resource scheduling.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- You have a CLI tool that does heavy processing (ML inference, media conversion, etc.)
|
||||
- Want to add a web interface for easier access
|
||||
- Need to track long-running job progress
|
||||
- Deploying to Kubernetes with GPU nodes
|
||||
- Files need to persist across pod restarts (NFS storage)
|
||||
|
||||
## Solution Overview
|
||||
|
||||
### Directory Structure
|
||||
```
|
||||
project-web/
|
||||
├── backend/
|
||||
│ ├── main.py # FastAPI app
|
||||
│ ├── api/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── routes.py # REST endpoints
|
||||
│ ├── services/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── converter.py # CLI wrapper + job manager
|
||||
│ ├── models/
|
||||
│ │ ├── __init__.py
|
||||
│ │ └── schemas.py # Pydantic models
|
||||
│ └── requirements.txt
|
||||
├── frontend/
|
||||
│ ├── src/
|
||||
│ │ ├── App.svelte
|
||||
│ │ ├── lib/
|
||||
│ │ │ ├── FileUpload.svelte
|
||||
│ │ │ ├── JobsList.svelte
|
||||
│ │ │ └── ProgressBar.svelte
|
||||
│ │ └── stores/
|
||||
│ │ └── jobs.js
|
||||
│ ├── package.json
|
||||
│ └── vite.config.js
|
||||
├── Dockerfile
|
||||
└── README.md
|
||||
```
|
||||
|
||||
### Backend: Job Manager Pattern
|
||||
```python
|
||||
# services/converter.py
|
||||
import asyncio
|
||||
import uuid
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional, Callable
|
||||
import subprocess
|
||||
|
||||
class Job:
|
||||
id: str
|
||||
filename: str
|
||||
status: str # pending, processing, completed, failed
|
||||
progress: float
|
||||
created_at: datetime
|
||||
output_file: Optional[str]
|
||||
error: Optional[str]
|
||||
|
||||
class JobManager:
|
||||
def __init__(self, storage_path: str = "/mnt"):
|
||||
self.storage_path = Path(storage_path)
|
||||
self.jobs: dict[str, Job] = {}
|
||||
self.progress_callbacks: dict[str, list[Callable]] = {}
|
||||
|
||||
def create_job(self, filename: str, **options) -> Job:
|
||||
job_id = str(uuid.uuid4())
|
||||
job = Job(
|
||||
id=job_id,
|
||||
filename=filename,
|
||||
status="pending",
|
||||
progress=0.0,
|
||||
created_at=datetime.now(),
|
||||
**options
|
||||
)
|
||||
self.jobs[job_id] = job
|
||||
return job
|
||||
|
||||
async def run_conversion(self, job_id: str):
|
||||
job = self.jobs[job_id]
|
||||
job.status = "processing"
|
||||
|
||||
input_path = self.storage_path / "uploads" / job.filename
|
||||
output_dir = self.storage_path / "outputs" / job_id
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Build command for CLI tool
|
||||
cmd = [
|
||||
"/path/to/cli-tool",
|
||||
str(input_path),
|
||||
"-o", str(output_dir),
|
||||
# Add other options...
|
||||
]
|
||||
|
||||
# Run with output capture for progress parsing
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
*cmd,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
)
|
||||
|
||||
# Parse output for progress updates
|
||||
async def read_output(stream):
|
||||
while True:
|
||||
line = await stream.readline()
|
||||
if not line:
|
||||
break
|
||||
line_str = line.decode().strip()
|
||||
# Parse progress from CLI output
|
||||
if "%" in line_str:
|
||||
# Extract and update progress
|
||||
self.update_progress(job_id, parsed_progress)
|
||||
|
||||
await asyncio.gather(
|
||||
read_output(process.stdout),
|
||||
read_output(process.stderr)
|
||||
)
|
||||
|
||||
returncode = await process.wait()
|
||||
|
||||
if returncode == 0:
|
||||
output_files = list(output_dir.glob("*.m4b"))
|
||||
if output_files:
|
||||
job.output_file = output_files[0].name
|
||||
job.status = "completed"
|
||||
else:
|
||||
job.status = "failed"
|
||||
job.error = f"Exit code {returncode}"
|
||||
|
||||
job_manager = JobManager()
|
||||
```
|
||||
|
||||
### Backend: API Routes
|
||||
```python
|
||||
# api/routes.py
|
||||
from fastapi import APIRouter, UploadFile, File, HTTPException
|
||||
from fastapi.responses import FileResponse
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
import asyncio
|
||||
|
||||
router = APIRouter(prefix="/api")
|
||||
|
||||
@router.post("/upload")
|
||||
async def upload_file(file: UploadFile = File(...)):
|
||||
upload_dir = Path("/mnt/uploads")
|
||||
upload_dir.mkdir(parents=True, exist_ok=True)
|
||||
file_path = upload_dir / file.filename
|
||||
|
||||
with file_path.open("wb") as buffer:
|
||||
shutil.copyfileobj(file.file, buffer)
|
||||
|
||||
return {"filename": file.filename, "size": file_path.stat().st_size}
|
||||
|
||||
@router.post("/jobs")
|
||||
async def create_job(request: JobCreate):
|
||||
job = job_manager.create_job(filename=request.filename, ...)
|
||||
asyncio.create_task(job_manager.run_conversion(job.id))
|
||||
return job
|
||||
|
||||
@router.get("/jobs")
|
||||
async def list_jobs():
|
||||
return job_manager.get_all_jobs()
|
||||
|
||||
@router.get("/jobs/{job_id}/download")
|
||||
async def download_job(job_id: str):
|
||||
job = job_manager.get_job(job_id)
|
||||
if not job or job.status != "completed":
|
||||
raise HTTPException(404)
|
||||
output_path = Path("/mnt/outputs") / job_id / job.output_file
|
||||
return FileResponse(output_path, filename=job.output_file)
|
||||
```
|
||||
|
||||
### Frontend: Svelte 5 Components
|
||||
```svelte
|
||||
<!-- FileUpload.svelte -->
|
||||
<script>
|
||||
let { onUpload } = $props();
|
||||
let dragOver = $state(false);
|
||||
let uploading = $state(false);
|
||||
|
||||
async function handleUpload(file) {
|
||||
uploading = true;
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
|
||||
const response = await fetch('/api/upload', {
|
||||
method: 'POST',
|
||||
body: formData
|
||||
});
|
||||
|
||||
if (response.ok) {
|
||||
const data = await response.json();
|
||||
onUpload(data.filename);
|
||||
}
|
||||
uploading = false;
|
||||
}
|
||||
</script>
|
||||
|
||||
<div class="dropzone"
|
||||
class:dragover={dragOver}
|
||||
ondragover={(e) => { e.preventDefault(); dragOver = true; }}
|
||||
ondragleave={() => dragOver = false}
|
||||
ondrop={(e) => { e.preventDefault(); handleUpload(e.dataTransfer.files[0]); }}>
|
||||
Drop file here
|
||||
</div>
|
||||
```
|
||||
|
||||
### Dockerfile
|
||||
```dockerfile
|
||||
FROM python:3.12-slim
|
||||
|
||||
# Install Node for frontend build
|
||||
RUN apt-get update && apt-get install -y nodejs npm
|
||||
|
||||
# Build frontend
|
||||
COPY frontend/ /app/frontend/
|
||||
WORKDIR /app/frontend
|
||||
RUN npm install && npm run build
|
||||
|
||||
# Install backend
|
||||
COPY backend/ /app/backend/
|
||||
WORKDIR /app/backend
|
||||
RUN pip install -r requirements.txt
|
||||
|
||||
# Serve static files from FastAPI
|
||||
EXPOSE 8000
|
||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
```
|
||||
|
||||
### Terraform Deployment (GPU)
|
||||
```hcl
|
||||
resource "kubernetes_deployment" "myapp" {
|
||||
spec {
|
||||
template {
|
||||
spec {
|
||||
node_selector = { "gpu" : "true" }
|
||||
|
||||
toleration {
|
||||
key = "nvidia.com/gpu"
|
||||
operator = "Equal"
|
||||
value = "true"
|
||||
effect = "NoSchedule"
|
||||
}
|
||||
|
||||
container {
|
||||
image = "myregistry/myapp@sha256:..."
|
||||
name = "myapp"
|
||||
|
||||
resources {
|
||||
limits = { "nvidia.com/gpu" = "1" }
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/mnt"
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
path = "/mnt/main/myapp"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Upload a file via the UI
|
||||
2. Start a conversion job
|
||||
3. Watch progress update in real-time
|
||||
4. Download the completed file
|
||||
5. Verify files persist across pod restarts
|
||||
|
||||
## Notes
|
||||
- Use image digest for reliable deployments (see `k8s-docker-registry-cache-bypass` skill)
|
||||
- NFS storage persists across pod restarts
|
||||
- GPU node taints require matching tolerations
|
||||
- Consider adding job persistence (database) for production use
|
||||
- WebSocket can provide smoother progress updates than polling
|
||||
|
||||
## See Also
|
||||
- `k8s-docker-registry-cache-bypass` - Fixing image cache issues
|
||||
- `k8s-gpu-no-nvidia-devices` - GPU device troubleshooting
|
||||
- `python-filename-sanitization` - Secure file handling
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
---
|
||||
name: grafana-stale-datasource-cleanup
|
||||
description: |
|
||||
Fix Grafana datasource errors when a Helm chart creates a datasource that conflicts
|
||||
with provisioned ones, or when stale datasources persist in the MySQL database.
|
||||
Use when: (1) Grafana shows "dial tcp: lookup <service> no such host" for a datasource,
|
||||
(2) Grafana API returns "datasources:delete permissions needed" when trying to remove
|
||||
a datasource, (3) provisioned datasource exists but Grafana uses a stale one from
|
||||
the database, (4) Helm chart auto-creates a datasource pointing to a disabled gateway
|
||||
service (e.g., loki-gateway). Requires direct MySQL access to fix when Grafana RBAC
|
||||
blocks API operations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Grafana Stale Datasource Cleanup
|
||||
|
||||
## Problem
|
||||
Grafana uses a stale or incorrect datasource from its MySQL database instead of
|
||||
the correctly provisioned one. Common when Helm charts auto-create datasources
|
||||
that point to services you've disabled (e.g., Loki gateway).
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Grafana shows error: `dial tcp: lookup loki-gateway on 10.96.0.10:53: no such host`
|
||||
- A provisioned datasource (via ConfigMap sidecar) is correct but Grafana uses a
|
||||
different one stored in MySQL
|
||||
- Grafana API returns `"permissions needed: datasources:delete"` or
|
||||
`"permissions needed: datasources:write"` even with admin credentials
|
||||
- Dashboard references a datasource UID that points to a wrong URL
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the stale datasource
|
||||
|
||||
List all datasources via API (this usually works even with RBAC):
|
||||
```bash
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s "http://localhost:3000/api/datasources" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -c \
|
||||
"import sys,json; [print(d['uid'], d['name'], d['url']) for d in json.load(sys.stdin)]"
|
||||
```
|
||||
|
||||
### Step 2: Try API deletion first
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s -X DELETE "http://localhost:3000/api/datasources/uid/<STALE_UID>" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"'
|
||||
```
|
||||
|
||||
If this returns a permissions error, proceed to Step 3.
|
||||
|
||||
### Step 3: Delete directly from MySQL
|
||||
|
||||
When Grafana RBAC blocks API operations, go through MySQL:
|
||||
|
||||
```bash
|
||||
# Find the Grafana MySQL password
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'echo $GF_DATABASE_PASSWORD'
|
||||
|
||||
# Find the stale datasource
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "SELECT id, uid, name, url FROM data_source;"
|
||||
|
||||
# Delete it
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "DELETE FROM data_source WHERE uid='<STALE_UID>';"
|
||||
```
|
||||
|
||||
### Step 4: Fix dashboards referencing the old UID
|
||||
|
||||
Dashboards store datasource UIDs in their JSON. Update via MySQL:
|
||||
```bash
|
||||
kubectl exec -n dbaas deploy/mysql -- mysql -u grafana -p"<PASSWORD>" grafana \
|
||||
-e "UPDATE dashboard SET data = REPLACE(data, '<OLD_UID>', '<NEW_UID>') WHERE title LIKE '%Dashboard Name%';"
|
||||
```
|
||||
|
||||
### Step 5: Refresh Grafana
|
||||
|
||||
Hard-refresh browser (Cmd+Shift+R). If datasource still doesn't appear:
|
||||
```bash
|
||||
kubectl rollout restart deploy -n monitoring grafana
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Verify only correct datasources remain
|
||||
kubectl exec -n monitoring deploy/grafana -c grafana -- \
|
||||
sh -c 'curl -s "http://localhost:3000/api/datasources" \
|
||||
-u "admin:$GF_SECURITY_ADMIN_PASSWORD"' | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Grafana's sidecar auto-discovers ConfigMaps with label `grafana_datasource: "1"`
|
||||
and provisions datasources from them. These are file-provisioned and show as
|
||||
"provisioned" in the UI.
|
||||
- Helm charts (e.g., Loki) may auto-create their own datasource in the Grafana
|
||||
database pointing to services like `loki-gateway`. If you disable the gateway,
|
||||
this datasource becomes stale.
|
||||
- Grafana dashboards in this repo are stored in MySQL (not file-provisioned),
|
||||
so dashboard JSON files in the repo are reference copies only.
|
||||
- The `GF_SECURITY_ADMIN_PASSWORD` env var is set by the Grafana Helm chart.
|
||||
- See also: `loki-helm-deployment-pitfalls` for related Loki deployment issues.
|
||||
253
.claude/skills/archived/helm-release-troubleshooting/SKILL.md
Normal file
253
.claude/skills/archived/helm-release-troubleshooting/SKILL.md
Normal file
|
|
@ -0,0 +1,253 @@
|
|||
---
|
||||
name: helm-release-troubleshooting
|
||||
description: |
|
||||
Troubleshoot and fix Helm release issues managed by Terraform. Use when:
|
||||
(1) Terraform applies successfully but K8s resources don't reflect new Helm values,
|
||||
(2) New ports/volumes/containers from Helm chart values don't appear in deployed resources,
|
||||
(3) helm upgrade --reuse-values doesn't re-render templates for structural changes,
|
||||
(4) Terraform thinks Helm release is up-to-date but actual K8s resources are stale,
|
||||
(5) terraform apply fails with "another operation (install/upgrade/rollback) is in progress",
|
||||
(6) helm history shows status "pending-upgrade" or "pending-rollback",
|
||||
(7) a Helm upgrade was interrupted by network timeout, etcd timeout, or VPN drop,
|
||||
(8) helm upgrade fails with "an error occurred while finding last successful release".
|
||||
Covers force re-rendering via state removal/reimport and stuck release recovery via
|
||||
secret cleanup.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Helm Release Troubleshooting
|
||||
|
||||
## Force Re-render
|
||||
|
||||
### Problem
|
||||
After changing Helm chart values in a Terraform `helm_release` resource, Terraform applies
|
||||
successfully but the actual Kubernetes resources (Services, Deployments, etc.) don't reflect
|
||||
the new values. For example, adding a new port in Helm values doesn't result in that port
|
||||
appearing in the Service spec.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- Terraform `helm_release` applies with "1 changed" but `kubectl get svc -o yaml` shows
|
||||
the old configuration
|
||||
- Structural changes to Helm values (new ports, new containers, new volumes) are not
|
||||
reflected in deployed resources
|
||||
- The Helm chart templates need to be fully re-rendered, not just patched
|
||||
- Common with Traefik, ingress-nginx, and other charts where template logic conditionally
|
||||
includes resources based on values
|
||||
|
||||
### Root Cause
|
||||
Terraform's `helm_release` resource uses `helm upgrade` under the hood. When values are
|
||||
changed, Helm may use `--reuse-values` behavior where it merges new values into existing
|
||||
ones rather than doing a full template re-render. For structural changes (like enabling
|
||||
HTTP/3 which adds a new UDP port to the Service template), the templates may not be
|
||||
re-rendered with the new conditional branches active.
|
||||
|
||||
Additionally, Terraform may see the stored Helm release state as matching the desired state
|
||||
even though the actual Kubernetes resources don't reflect it, creating a state drift that
|
||||
Terraform doesn't detect.
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Verify the Discrepancy
|
||||
|
||||
Confirm that K8s resources don't match Helm values:
|
||||
```bash
|
||||
# Check the actual resource
|
||||
kubectl get svc <service-name> -n <namespace> -o yaml
|
||||
|
||||
# Check what Helm thinks is deployed
|
||||
helm get values <release-name> -n <namespace>
|
||||
helm get manifest <release-name> -n <namespace> | grep -A10 "<expected-config>"
|
||||
```
|
||||
|
||||
#### Step 2: Remove Helm Release from Terraform State
|
||||
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
```
|
||||
|
||||
**IMPORTANT**: This only removes from Terraform state. The actual Helm release and K8s
|
||||
resources remain untouched in the cluster.
|
||||
|
||||
#### Step 3: Import the Helm Release Back
|
||||
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
|
||||
```
|
||||
|
||||
For Helm releases, the import ID format is `namespace/release-name`.
|
||||
|
||||
#### Step 4: Force Apply with Terraform
|
||||
|
||||
After reimporting, run terraform apply. Terraform should now detect the drift between
|
||||
the desired Helm values and the actual release state:
|
||||
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
If Terraform still shows "no changes", you may need to taint the resource:
|
||||
```bash
|
||||
terraform taint 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
#### Step 5: Manual Helm Force Upgrade (Last Resort)
|
||||
|
||||
If Terraform still doesn't fix it, use Helm directly as a one-time fix, then reimport:
|
||||
|
||||
```bash
|
||||
# Get the current values file
|
||||
helm get values <release-name> -n <namespace> -o yaml > /tmp/values.yaml
|
||||
|
||||
# Edit /tmp/values.yaml to include the correct values, or use --set flags
|
||||
|
||||
# Force upgrade (re-renders all templates)
|
||||
helm upgrade --force <release-name> <chart> -n <namespace> -f /tmp/values.yaml
|
||||
|
||||
# Then reimport into Terraform
|
||||
terraform state rm 'module.kubernetes_cluster.module.<service>.helm_release.<name>'
|
||||
terraform import 'module.kubernetes_cluster.module.<service>.helm_release.<name>' '<namespace>/<release-name>'
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service>
|
||||
```
|
||||
|
||||
**WARNING**: Direct Helm operations bypass Terraform. Always reimport into Terraform state
|
||||
afterward, and use `terraform apply` to verify Terraform is back in sync.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check the K8s resources now match expected configuration
|
||||
kubectl get svc <service-name> -n <namespace> -o yaml
|
||||
kubectl get deployment <deployment-name> -n <namespace> -o yaml
|
||||
|
||||
# Verify Terraform is in sync
|
||||
terraform plan -target=module.kubernetes_cluster.module.<service>
|
||||
# Should show "No changes" or minimal expected drift
|
||||
```
|
||||
|
||||
### Example: Traefik HTTP/3 UDP Port Not Appearing
|
||||
|
||||
**Problem**: Added `http3.enabled=true` to Traefik Helm values. Terraform applied
|
||||
successfully, but the Traefik Service only had TCP port 443, missing the expected
|
||||
UDP port 443 (`websecure-http3`).
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# 1. Remove from state
|
||||
terraform state rm 'module.kubernetes_cluster.module.traefik.helm_release.traefik'
|
||||
|
||||
# 2. Reimport
|
||||
terraform import 'module.kubernetes_cluster.module.traefik.helm_release.traefik' 'traefik/traefik'
|
||||
|
||||
# 3. Apply (Terraform now detects the drift)
|
||||
terraform apply -target=module.kubernetes_cluster.module.traefik
|
||||
|
||||
# 4. Verify
|
||||
kubectl get svc traefik -n traefik -o yaml | grep -A3 "websecure-http3"
|
||||
# Should show: port: 443, protocol: UDP
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- This issue is more common with structural Helm value changes (new ports, new sidecars,
|
||||
conditional template blocks) than with simple value changes (image tags, replica counts)
|
||||
- The `helm upgrade --force` flag deletes and recreates resources that have changed,
|
||||
which causes brief downtime. Use with caution on production ingress controllers.
|
||||
- Always verify with `terraform plan` after fixing to ensure Terraform state is consistent
|
||||
|
||||
---
|
||||
|
||||
## Stuck Release Recovery
|
||||
|
||||
### Problem
|
||||
Helm releases can get stuck in `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
states when an upgrade is interrupted (network drop, etcd timeout, resource exhaustion).
|
||||
Subsequent upgrades or terraform applies fail because Helm thinks an operation is in progress.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- `terraform apply` fails with: `another operation (install/upgrade/rollback) is in progress`
|
||||
- `helm history <release> -n <namespace>` shows `pending-upgrade`, `pending-rollback`, or `pending-install`
|
||||
- A previous Helm upgrade was interrupted by network timeout, VPN drop, or etcd timeout
|
||||
- `helm upgrade` fails with: `an error occurred while finding last successful release`
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Identify the stuck release
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -5
|
||||
```
|
||||
|
||||
Look for revisions with status `pending-upgrade`, `pending-rollback`, or `pending-install`.
|
||||
|
||||
#### Step 2: Delete the stuck Helm release secrets
|
||||
Each Helm revision is stored as a Kubernetes secret named `sh.helm.release.v1.<release>.v<revision>`.
|
||||
Delete all stuck revisions:
|
||||
|
||||
```bash
|
||||
# Delete specific stuck revision (e.g., revision 5)
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v5 -n <namespace>
|
||||
|
||||
# If multiple stuck revisions exist, delete all of them
|
||||
kubectl --kubeconfig $(pwd)/config delete secret sh.helm.release.v1.<release>.v6 -n <namespace>
|
||||
```
|
||||
|
||||
#### Step 3: Verify the release is clean
|
||||
```bash
|
||||
helm --kubeconfig $(pwd)/config history <release> -n <namespace> | tail -3
|
||||
```
|
||||
|
||||
The latest revision should now show `deployed` status.
|
||||
|
||||
#### Step 4: Retry the upgrade
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **Never patch the secret labels** (e.g., changing `status: pending-rollback` to `status: failed`).
|
||||
This changes the label but not the encoded release data inside the secret, leaving Helm in an
|
||||
inconsistent state. Always delete the stuck secrets entirely.
|
||||
- If the failed upgrade partially applied changes to the cluster (e.g., modified a Deployment),
|
||||
the next successful upgrade will reconcile the state.
|
||||
- When VPN/network is unstable, prefer direct `helm upgrade --reuse-values --set key=value`
|
||||
over `terraform apply`, since Helm upgrades are faster than the full Terraform refresh cycle.
|
||||
|
||||
### Verification
|
||||
After deleting stuck secrets and re-applying:
|
||||
- `helm history` shows the new revision as `deployed`
|
||||
- `terraform apply` completes without errors
|
||||
|
||||
### Example
|
||||
```bash
|
||||
# Helm history shows stuck state
|
||||
$ helm history nextcloud -n nextcloud | tail -3
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
5 failed nextcloud-8.8.1 Upgrade failed: etcd timeout
|
||||
6 pending-rollback nextcloud-8.8.1 Rollback to 4
|
||||
|
||||
# Fix: delete stuck revisions
|
||||
$ kubectl delete secret sh.helm.release.v1.nextcloud.v5 sh.helm.release.v1.nextcloud.v6 -n nextcloud
|
||||
|
||||
# Verify clean state
|
||||
$ helm history nextcloud -n nextcloud | tail -1
|
||||
4 deployed nextcloud-8.8.1 Upgrade complete
|
||||
|
||||
# Re-apply
|
||||
$ terraform apply -target=module.kubernetes_cluster.module.nextcloud -auto-approve
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- `terraform-state-identity-mismatch` - For Terraform provider identity errors
|
||||
- `traefik-http3-quic` - For enabling HTTP/3 on Traefik (common trigger for force re-render)
|
||||
|
||||
## References
|
||||
|
||||
- [Terraform helm_release Resource](https://registry.terraform.io/providers/hashicorp/helm/latest/docs/resources/release)
|
||||
- [Helm Upgrade Documentation](https://helm.sh/docs/helm/helm_upgrade/)
|
||||
- [Helm --force Flag](https://helm.sh/docs/helm/helm_upgrade/#options)
|
||||
157
.claude/skills/archived/ingress-factory-migration/SKILL.md
Normal file
157
.claude/skills/archived/ingress-factory-migration/SKILL.md
Normal file
|
|
@ -0,0 +1,157 @@
|
|||
---
|
||||
name: ingress-factory-migration
|
||||
description: |
|
||||
Migrate raw kubernetes_ingress_v1 resources to the centralized ingress_factory module.
|
||||
Use when: (1) a service defines a raw kubernetes_ingress_v1 with hand-rolled Traefik
|
||||
middleware annotations, (2) adding a new service that needs standard ingress with
|
||||
rate limiting, CrowdSec, CSP headers, rybbit analytics, or authentik auth,
|
||||
(3) refactoring existing ingresses for consistency. Covers single-path, multi-path,
|
||||
split UI/API, full_host overrides, custom rate limits, and extra middleware injection.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-10
|
||||
---
|
||||
|
||||
# Ingress Factory Migration
|
||||
|
||||
## Problem
|
||||
Services define raw `kubernetes_ingress_v1` resources with hand-rolled Traefik middleware
|
||||
chains. This creates inconsistency - middleware chains are copy-pasted per service, making
|
||||
it easy to miss security middleware (CrowdSec, rate limiting) or analytics (rybbit). The
|
||||
`ingress_factory` module at `modules/kubernetes/ingress_factory/main.tf` provides a single
|
||||
point of control.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Service has a raw `kubernetes_ingress_v1` resource instead of using `module "ingress"`
|
||||
- Service has a manually defined `kubernetes_manifest` for rybbit analytics middleware
|
||||
- New service needs standard ingress configuration
|
||||
- Middleware chain needs to be updated across many services
|
||||
|
||||
## Solution
|
||||
|
||||
### Standard single-path ingress
|
||||
Replace the raw resource with:
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service-name>" # becomes the ingress name AND default hostname
|
||||
host = "<subdomain>" # optional: override hostname (if different from name)
|
||||
service_name = "<k8s-service-name>" # optional: defaults to name
|
||||
port = 80 # optional: defaults to 80
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = false # set true for authentik forward auth
|
||||
}
|
||||
```
|
||||
|
||||
### Multi-path / split UI+API
|
||||
Use two module calls with different names but same host:
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
host = "<subdomain>"
|
||||
service_name = "<ui-service>"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
rybbit_site_id = "<id>" # optional: adds rybbit analytics
|
||||
}
|
||||
|
||||
module "ingress-api" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>-api"
|
||||
host = "<subdomain>" # same host as UI
|
||||
service_name = "<api-service>"
|
||||
ingress_path = ["/api"]
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# No rybbit_site_id - API returns JSON, not HTML
|
||||
}
|
||||
```
|
||||
|
||||
### Full host override (for root domain like viktorbarzin.me)
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
service_name = "<k8s-service>"
|
||||
full_host = "viktorbarzin.me" # bypasses name.root_domain construction
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
```
|
||||
|
||||
### Custom rate limiting (e.g., immich)
|
||||
```hcl
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
skip_default_rate_limit = true
|
||||
extra_middlewares = ["traefik-<custom>-rate-limit@kubernetescrd"]
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
```
|
||||
|
||||
### Key variables reference
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `name` | required | Ingress resource name + default hostname |
|
||||
| `host` | null | Override hostname prefix (name used if null) |
|
||||
| `full_host` | null | Override entire hostname (bypasses root_domain) |
|
||||
| `service_name` | null | K8s service name (name used if null) |
|
||||
| `port` | 80 | Backend service port |
|
||||
| `ingress_path` | ["/"] | URL paths to match |
|
||||
| `protected` | false | Adds authentik forward auth middleware |
|
||||
| `rybbit_site_id` | null | Adds rybbit analytics script injection |
|
||||
| `skip_default_rate_limit` | false | Omits default rate limiter |
|
||||
| `extra_middlewares` | [] | Additional middleware references to append |
|
||||
| `extra_annotations` | {} | Additional ingress annotations |
|
||||
| `allow_local_access_only` | false | Restricts to LAN/VPN |
|
||||
| `exclude_crowdsec` | false | Skips CrowdSec middleware |
|
||||
| `custom_content_security_policy` | null | Custom CSP header |
|
||||
|
||||
### After migration, delete:
|
||||
1. The raw `kubernetes_ingress_v1` resource
|
||||
2. Any manually defined `kubernetes_manifest "rybbit_analytics"` (the factory creates this automatically when `rybbit_site_id` is set)
|
||||
|
||||
## Gotchas
|
||||
|
||||
### Duplicate module names
|
||||
If the service directory has multiple `.tf` files (e.g., `main.tf` and `frame.tf`), check
|
||||
for existing `module "ingress"` blocks. Module names must be unique within a directory.
|
||||
Use a descriptive name like `module "ingress-immich"` instead.
|
||||
|
||||
### Terraform target module names with hyphens
|
||||
Module names in `terraform state list` may use hyphens (e.g., `module.real-estate-crawler`).
|
||||
When using `-target`, you must match the exact name including hyphens:
|
||||
```bash
|
||||
# Wrong - underscores:
|
||||
terraform apply -target=module.kubernetes_cluster.module.real_estate_crawler
|
||||
|
||||
# Correct - hyphens (quote to prevent shell interpretation):
|
||||
terraform apply '-target=module.kubernetes_cluster.module.real-estate-crawler'
|
||||
```
|
||||
|
||||
### Service name defaults
|
||||
The factory defaults `service_name` to `name`. If the K8s service has a different name
|
||||
than the ingress, you must explicitly set `service_name`. Common case: headscale has one
|
||||
K8s service named `headscale` with multiple ports, so the UI ingress needs
|
||||
`service_name = "headscale"` even though `name = "headscale-ui"`.
|
||||
|
||||
### Servarr subdirectory source path
|
||||
Services under `servarr/` need `../../ingress_factory` as the source path instead of
|
||||
`../ingress_factory`.
|
||||
|
||||
## Verification
|
||||
1. `terraform validate` - check for syntax errors
|
||||
2. `terraform plan -target=module.kubernetes_cluster.module.<service>` - verify old ingress destroyed, new created
|
||||
3. `kubectl get ingress -n <namespace>` - verify ingress exists with correct host/paths
|
||||
4. Browse the service URL to confirm accessibility
|
||||
|
||||
## Notes
|
||||
- Services using special protocols (gRPC, mTLS, WebSocket with custom headers) should NOT
|
||||
be migrated - keep raw `kubernetes_ingress_v1` for those
|
||||
- The factory automatically includes: rate-limit, CSP headers, CrowdSec, and entrypoint=websecure
|
||||
- When `rybbit_site_id` is set, the factory creates a `kubernetes_manifest` for the
|
||||
rewrite-body middleware that injects the analytics script into HTML responses
|
||||
|
|
@ -0,0 +1,80 @@
|
|||
---
|
||||
name: iterative-plan-review-with-subagents
|
||||
description: |
|
||||
Design pattern for reviewing implementation plans using parallel subagent reviewers
|
||||
with iterative refinement. Use when: (1) designing a complex infrastructure change
|
||||
that needs security + implementation review, (2) creating a migration plan with
|
||||
multiple phases, (3) any plan where missing a critical issue could cause data loss
|
||||
or security exposure. Spawns 2 reviewer agents (security + implementation), collects
|
||||
CRITICAL/IMPORTANT/NIT findings, fixes all CRITICALs, re-runs until zero CRITICALs.
|
||||
Typically converges in 2-3 iterations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-07
|
||||
---
|
||||
|
||||
# Iterative Plan Review with Subagents
|
||||
|
||||
## Problem
|
||||
Complex infrastructure plans have blind spots — security issues, implementation
|
||||
incompatibilities, race conditions, format mismatches. A single reviewer misses things.
|
||||
Multiple reviewers with different expertise catch more.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Writing a migration plan (e.g., secrets management, storage migration)
|
||||
- Designing a multi-phase infrastructure change
|
||||
- Any plan where a missed issue = downtime, data loss, or security exposure
|
||||
- User explicitly asks for plan review
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Write the plan as a markdown document
|
||||
Save to `docs/plans/YYYY-MM-DD-<topic>.md`
|
||||
|
||||
### 2. Spawn 2 reviewer agents in parallel
|
||||
```
|
||||
Agent 1: Security reviewer
|
||||
- Focus: secret exposure, access control, key management, CI pipeline security
|
||||
- Classify each finding: CRITICAL / IMPORTANT / NIT
|
||||
|
||||
Agent 2: Implementation reviewer
|
||||
- Focus: format compatibility, race conditions, ordering, tool behavior
|
||||
- Classify each finding: CRITICAL / IMPORTANT / NIT
|
||||
```
|
||||
|
||||
Key: give each reviewer specific focus areas and the actual source code to check against.
|
||||
|
||||
### 3. Consolidate and fix CRITICALs
|
||||
- Merge findings from both reviewers
|
||||
- Deduplicate (both often find the same issue)
|
||||
- Fix ALL CRITICALs in the plan document
|
||||
- Note IMPORTANTs for implementation phase
|
||||
|
||||
### 4. Re-run reviewers on the updated plan
|
||||
- Same 2 agents, but tell them which CRITICALs were fixed
|
||||
- Ask them to VERIFY fixes are correct AND find new issues
|
||||
- Repeat until zero CRITICALs
|
||||
|
||||
### 5. Typical convergence
|
||||
- v1: 5-6 CRITICALs (format issues, race conditions, missing steps)
|
||||
- v2: 2-3 CRITICALs (fixes introduced new issues, missed edge cases)
|
||||
- v3: 0 CRITICALs, only IMPORTANTs remaining
|
||||
|
||||
## Example Findings from Real Usage (SOPS migration)
|
||||
|
||||
| Iteration | CRITICALs Found | Examples |
|
||||
|-----------|----------------|---------|
|
||||
| v1 | 6 | YAML≠HCL format, `git add .` commits secrets, no branch protection, parallel race condition |
|
||||
| v2 | 3 | `SOPS_AGE_KEY_FILE` misunderstanding, `renew-tls.yml` not updated, plan leaks in PR logs |
|
||||
| v3 | 0 | All verified fixed. 6 IMPORTANTs noted for implementation. |
|
||||
|
||||
## Verification
|
||||
- Zero CRITICALs from both reviewers on the final iteration
|
||||
- IMPORTANTs documented as implementation notes (not blockers)
|
||||
|
||||
## Notes
|
||||
- Use `sonnet` model for reviewers (fast, thorough enough for review)
|
||||
- Give reviewers actual source code paths to read, not just the plan
|
||||
- Tell v2+ reviewers what was fixed so they verify, not re-discover
|
||||
- The final review should say "ONLY report CRITICALs" to avoid noise
|
||||
- This pattern cost ~$3-5 in API calls but caught issues that would have caused hours of debugging
|
||||
244
.claude/skills/archived/k8s-container-image-caching/SKILL.md
Normal file
244
.claude/skills/archived/k8s-container-image-caching/SKILL.md
Normal file
|
|
@ -0,0 +1,244 @@
|
|||
---
|
||||
name: k8s-container-image-caching
|
||||
description: |
|
||||
Set up and troubleshoot container image pull-through caches in Kubernetes. Use when:
|
||||
(1) ImagePullBackOff for non-Docker-Hub images routed through a wildcard mirror,
|
||||
(2) containerd has deprecated `registry.mirrors."*"` catching all image pulls,
|
||||
(3) need to add pull-through cache for a new upstream registry,
|
||||
(4) `mirrors` cannot be set when `config_path` is provided error in containerd,
|
||||
(5) containerd 1.6.x vs 1.7.x config_path compatibility issues,
|
||||
(6) kubectl shows correct image tag but container runs old code,
|
||||
(7) local registry mirror caches stale images,
|
||||
(8) imagePullPolicy: Always doesn't force fresh pulls,
|
||||
(9) containerd config has mirror that intercepts pulls serving stale images.
|
||||
Covers multi-registry pull-through cache setup (Docker Registry v2) and cache bypass
|
||||
via image digest pinning.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Kubernetes Container Image Caching
|
||||
|
||||
## Pull-Through Cache Setup
|
||||
|
||||
### Problem
|
||||
|
||||
Docker Registry v2 can only proxy **one upstream registry per instance**. A common
|
||||
misconfiguration is using a containerd wildcard mirror (`registry.mirrors."*"`) pointing
|
||||
to a single Docker Hub proxy, which breaks pulls from ghcr.io, quay.io, registry.k8s.io,
|
||||
and other registries -- they get routed to the Docker Hub proxy which can't serve them,
|
||||
causing `ImagePullBackOff`.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- `ImagePullBackOff` for images from ghcr.io, quay.io, registry.k8s.io, or other non-Docker-Hub registries
|
||||
- Containerd config has deprecated `[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]`
|
||||
- Error: `failed to load plugin io.containerd.grpc.v1.cri: invalid plugin config: mirrors cannot be set when config_path is provided`
|
||||
- Need to migrate from deprecated wildcard mirrors to modern `config_path` approach
|
||||
|
||||
### Solution
|
||||
|
||||
#### 1. Run one Registry v2 container per upstream
|
||||
|
||||
Each upstream needs its own Docker Registry v2 instance on a different port:
|
||||
|
||||
| Port | Registry | Container Name |
|
||||
|------|----------|---------------|
|
||||
| 5000 | docker.io | registry |
|
||||
| 5010 | ghcr.io | registry-ghcr |
|
||||
| 5020 | quay.io | registry-quay |
|
||||
| 5030 | registry.k8s.io | registry-k8s |
|
||||
| 5040 | reg.kyverno.io | registry-kyverno |
|
||||
|
||||
Config for non-Docker-Hub proxies (no auth needed -- they're public):
|
||||
|
||||
```yaml
|
||||
version: 0.1
|
||||
storage:
|
||||
cache:
|
||||
blobdescriptor: inmemory
|
||||
filesystem:
|
||||
rootdirectory: /var/lib/registry
|
||||
http:
|
||||
addr: :5000
|
||||
proxy:
|
||||
remoteurl: https://ghcr.io # change per registry
|
||||
```
|
||||
|
||||
```bash
|
||||
docker run -p 5010:5000 -d --restart always --name registry-ghcr \
|
||||
-v /etc/docker-registry/ghcr/config.yml:/etc/docker/registry/config.yml registry:2
|
||||
```
|
||||
|
||||
#### 2. Replace deprecated wildcard mirror with `config_path`
|
||||
|
||||
Instead of:
|
||||
```toml
|
||||
# DEPRECATED - breaks non-Docker-Hub registries
|
||||
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
|
||||
endpoint = ["http://10.0.20.10:5000"]
|
||||
```
|
||||
|
||||
Use the modern `config_path` approach:
|
||||
```toml
|
||||
[plugins."io.containerd.grpc.v1.cri".registry]
|
||||
config_path = "/etc/containerd/certs.d"
|
||||
```
|
||||
|
||||
Then create per-registry `hosts.toml` files:
|
||||
```bash
|
||||
mkdir -p /etc/containerd/certs.d/docker.io
|
||||
cat > /etc/containerd/certs.d/docker.io/hosts.toml <<'EOF'
|
||||
server = "https://registry-1.docker.io"
|
||||
|
||||
[host."http://10.0.20.10:5000"]
|
||||
capabilities = ["pull", "resolve"]
|
||||
EOF
|
||||
```
|
||||
|
||||
Registries without a `hosts.toml` entry **fall through to direct pull** (no breakage).
|
||||
|
||||
#### 3. Critical: `config_path` and `mirrors` cannot coexist
|
||||
|
||||
Containerd will **refuse to start the CRI plugin** if both `config_path` and any
|
||||
`mirrors` entries exist in `config.toml`. You must remove ALL `mirrors` entries
|
||||
(including the `[plugins."...registry.mirrors"]` parent section) before setting
|
||||
`config_path`.
|
||||
|
||||
This is especially dangerous on containerd 1.6.x (used on older nodes like k8s-master)
|
||||
where the config format is slightly different. If unsure, either:
|
||||
- Don't use config_path on that node (skip the pull-through cache)
|
||||
- Remove the entire `mirrors` section first, then add `config_path`
|
||||
|
||||
#### 4. Static IP for registry VM
|
||||
|
||||
If the registry VM uses DHCP and gets the wrong IP, all mirrors break. Use static IP
|
||||
via cloud-init `ipconfig0 = "ip=10.0.20.10/24,gw=10.0.20.1"` instead of DHCP.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test each proxy responds
|
||||
for port in 5000 5010 5020 5030 5040; do
|
||||
curl -s http://10.0.20.10:$port/v2/_catalog
|
||||
done
|
||||
|
||||
# Test containerd can pull through cache
|
||||
crictl pull ghcr.io/some/image:tag
|
||||
|
||||
# Check containerd logs for mirror usage
|
||||
journalctl -u containerd --since "5 minutes ago" | grep -i "mirror\|registry"
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
- **Fallback behavior**: If the local mirror is unreachable, containerd falls through to
|
||||
direct pull from the upstream `server` URL. This provides graceful degradation.
|
||||
- **GC crontabs**: Add weekly garbage collection for each registry container, staggered
|
||||
to avoid I/O spikes.
|
||||
- **Hourly restart**: Registry v2 has known memory leak issues; hourly restart mitigates.
|
||||
- **Cache is ephemeral**: VM recreation clears the cache. Images re-cache on demand.
|
||||
|
||||
---
|
||||
|
||||
## Cache Bypass / Stale Image Fix
|
||||
|
||||
### Problem
|
||||
Kubernetes pods continue running old Docker images even after pushing new versions with
|
||||
the same tag (e.g., `:latest`). This happens when a local registry mirror caches images
|
||||
and serves stale versions, ignoring `imagePullPolicy: Always`.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
- Pod is running but application code is outdated
|
||||
- `docker push` succeeded with new layers
|
||||
- `kubectl describe pod` shows correct image tag
|
||||
- Cluster has a local registry mirror configured (e.g., in containerd config)
|
||||
- `imagePullPolicy: Always` doesn't fix the issue
|
||||
- Nodes configured with registry mirrors at `/etc/containerd/certs.d/` or similar
|
||||
|
||||
### Solution
|
||||
|
||||
#### 1. Get the image digest after pushing
|
||||
```bash
|
||||
docker push viktorbarzin/myimage:latest
|
||||
# Output includes: latest: digest: sha256:abc123... size: 856
|
||||
```
|
||||
|
||||
#### 2. Use digest instead of tag in deployment
|
||||
```hcl
|
||||
# Terraform
|
||||
container {
|
||||
# Use digest to bypass local registry cache
|
||||
image = "docker.io/viktorbarzin/myimage@sha256:abc123..."
|
||||
image_pull_policy = "Always"
|
||||
name = "myimage"
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Kubernetes YAML
|
||||
containers:
|
||||
- name: myimage
|
||||
image: docker.io/viktorbarzin/myimage@sha256:abc123...
|
||||
imagePullPolicy: Always
|
||||
```
|
||||
|
||||
#### 3. Apply and restart
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.myservice
|
||||
kubectl rollout restart deployment/myservice -n mynamespace
|
||||
```
|
||||
|
||||
### Why This Works
|
||||
- Registry mirrors match by tag, not digest
|
||||
- When you specify a digest, the node must fetch that exact manifest
|
||||
- The mirror may not have the digest cached, forcing a pull from upstream
|
||||
- Even if cached, the digest guarantees the exact image version
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# Check the pod is using the new image
|
||||
kubectl get pod -n mynamespace -o jsonpath='{.items[*].spec.containers[*].image}'
|
||||
|
||||
# Verify application behavior reflects new code
|
||||
kubectl exec -n mynamespace deploy/myservice -- <verification-command>
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
Before (problematic):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web:latest"
|
||||
```
|
||||
|
||||
After (fixed):
|
||||
```hcl
|
||||
image = "docker.io/viktorbarzin/audiblez-web@sha256:4d0e2c839555e2229bc91a0b1273569bac88529e8b3c3cadad3c3cf9d865fa29"
|
||||
```
|
||||
|
||||
### Notes
|
||||
- You must update the digest each time you push a new image
|
||||
- Consider automating digest extraction in CI/CD pipelines
|
||||
- This is a workaround; ideally fix the registry mirror configuration
|
||||
- To find your registry mirror config: `cat /etc/containerd/config.toml` on nodes
|
||||
- Common mirror locations: `/etc/containerd/certs.d/docker.io/hosts.toml`
|
||||
|
||||
### Diagnosing Registry Mirror Issues
|
||||
```bash
|
||||
# On a k8s node, check containerd config
|
||||
cat /etc/containerd/config.toml | grep -A5 mirrors
|
||||
|
||||
# Check if mirror is intercepting
|
||||
crictl pull docker.io/library/alpine:latest --debug 2>&1 | grep -i mirror
|
||||
|
||||
# List cached images on node
|
||||
crictl images | grep myimage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Kubernetes imagePullPolicy documentation](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy)
|
||||
- [containerd registry configuration](https://github.com/containerd/containerd/blob/main/docs/hosts.md)
|
||||
186
.claude/skills/archived/k8s-gpu-no-nvidia-devices/SKILL.md
Normal file
186
.claude/skills/archived/k8s-gpu-no-nvidia-devices/SKILL.md
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
---
|
||||
name: k8s-gpu-no-nvidia-devices
|
||||
description: |
|
||||
Fix for Kubernetes GPU pods showing "CUDA not supported" or no /dev/nvidia* devices
|
||||
despite nvidia.com/gpu resource allocation. Use when: (1) container runs but torch.cuda.is_available()
|
||||
returns False, (2) ls /dev/nvidia* shows "no matches found", (3) nvidia-smi fails inside pod
|
||||
but works on host, (4) PyTorch/TensorFlow falls back to CPU despite GPU allocation.
|
||||
Covers NVIDIA device plugin, time-slicing, and container runtime issues.
|
||||
author: Claude Code
|
||||
version: 1.1.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# Kubernetes GPU Pod - No NVIDIA Devices Found
|
||||
|
||||
## Problem
|
||||
|
||||
A Kubernetes pod requests GPU resources (`nvidia.com/gpu: 1`) and schedules on a GPU node,
|
||||
but inside the container there are no NVIDIA devices visible. The application falls back
|
||||
to CPU with messages like "CUDA not supported by the Torch installed!" despite running
|
||||
in a CUDA-enabled container image.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- Pod shows `Running` status and is on a node with `gpu=true` label
|
||||
- `kubectl describe pod` shows GPU limit/request is satisfied
|
||||
- Inside container: `ls /dev/nvidia*` returns "no matches found"
|
||||
- Inside container: `nvidia-smi` fails or command not found
|
||||
- Application logs show: "CUDA not supported", "Switching to CPU", "torch.cuda.is_available() = False"
|
||||
- On the host node: `nvidia-smi` works fine
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Verify GPU Availability
|
||||
|
||||
Check if other pods are consuming the GPU:
|
||||
|
||||
```bash
|
||||
# List all pods using GPU resources
|
||||
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits."nvidia.com/gpu" != null) | "\(.metadata.namespace)/\(.metadata.name)"'
|
||||
|
||||
# Check NVIDIA device plugin pods
|
||||
kubectl get pods -n nvidia -l app=nvidia-device-plugin
|
||||
kubectl logs -n nvidia -l app=nvidia-device-plugin --tail=50
|
||||
```
|
||||
|
||||
### Step 2: Free GPU Resources
|
||||
|
||||
If another workload is using the GPU, unload it:
|
||||
|
||||
```bash
|
||||
# For Ollama specifically
|
||||
kubectl exec -n ollama deployment/ollama -- ollama stop <model_name>
|
||||
|
||||
# Or scale down the conflicting deployment
|
||||
kubectl scale deployment/<name> -n <namespace> --replicas=0
|
||||
```
|
||||
|
||||
### Step 3: Restart the Affected Pod
|
||||
|
||||
After freeing GPU resources, restart the pod to get fresh device allocation:
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment/<name> -n <namespace>
|
||||
|
||||
# Or delete the pod directly
|
||||
kubectl delete pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
### Step 4: Verify GPU Access
|
||||
|
||||
```bash
|
||||
# Check devices are now visible
|
||||
kubectl exec -n <namespace> deployment/<name> -- ls -la /dev/nvidia*
|
||||
|
||||
# Test nvidia-smi
|
||||
kubectl exec -n <namespace> deployment/<name> -- nvidia-smi
|
||||
|
||||
# Test PyTorch CUDA
|
||||
kubectl exec -n <namespace> deployment/<name> -- python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After restart, you should see:
|
||||
|
||||
```
|
||||
/dev/nvidia0
|
||||
/dev/nvidiactl
|
||||
/dev/nvidia-uvm
|
||||
/dev/nvidia-uvm-tools
|
||||
```
|
||||
|
||||
And `nvidia-smi` should show the GPU with your container process.
|
||||
|
||||
## Example
|
||||
|
||||
```bash
|
||||
# Problem: ebook2audiobook shows "CUDA not supported"
|
||||
$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- ls /dev/nvidia*
|
||||
zsh:1: no matches found: /dev/nvidia*
|
||||
|
||||
# Solution: Unload Ollama model holding the GPU
|
||||
$ kubectl exec -n ollama deployment/ollama -- ollama ps
|
||||
NAME SIZE PROCESSOR
|
||||
qwen2.5:14b 10 GB 33%/67% CPU/GPU
|
||||
|
||||
$ kubectl exec -n ollama deployment/ollama -- ollama stop qwen2.5:14b
|
||||
|
||||
# Restart the affected pod
|
||||
$ kubectl rollout restart deployment/ebook2audiobook -n ebook2audiobook
|
||||
|
||||
# Verify
|
||||
$ kubectl exec -n ebook2audiobook deployment/ebook2audiobook -- nvidia-smi
|
||||
# Should now show the Tesla T4 GPU
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **GPU Time-Slicing**: If using NVIDIA GPU time-slicing (configured in GPU Operator),
|
||||
multiple pods can share a GPU. However, device injection still requires proper timing.
|
||||
|
||||
- **Pod Scheduling Order**: Pods that start while GPU is fully allocated may not get
|
||||
devices injected even after GPU becomes available - a restart is required.
|
||||
|
||||
- **Container Runtime**: The NVIDIA Container Toolkit must be properly configured.
|
||||
Issues can arise from:
|
||||
- cgroup driver mismatch (systemd vs cgroupfs)
|
||||
- Container updates causing device loss
|
||||
- SELinux blocking device access
|
||||
|
||||
- **Image Compatibility**: The container image must have CUDA libraries matching the
|
||||
driver version. Check with `nvidia-smi` on host for driver version.
|
||||
|
||||
- **This Cluster**: Uses NVIDIA GPU Operator with time-slicing (20 replicas per GPU).
|
||||
GPU node is `k8s-node1` with Tesla T4.
|
||||
|
||||
## See Also
|
||||
|
||||
- Check GPU Operator status: `kubectl get pods -n nvidia`
|
||||
- View time-slicing config: `kubectl get configmap -n nvidia time-slicing-config -o yaml`
|
||||
|
||||
## Automatic GPU Recovery via Liveness Probe
|
||||
|
||||
To prevent GPU loss from requiring manual intervention, add a liveness probe that checks
|
||||
both GPU availability and application health. Example for Frigate (but applicable to any
|
||||
GPU workload):
|
||||
|
||||
```hcl
|
||||
# Restart pod if GPU becomes unavailable or app hangs
|
||||
liveness_probe {
|
||||
exec {
|
||||
command = ["sh", "-c", "nvidia-smi > /dev/null 2>&1 && curl -sf http://localhost:<port>/health > /dev/null"]
|
||||
}
|
||||
initial_delay_seconds = 120
|
||||
period_seconds = 60
|
||||
timeout_seconds = 10
|
||||
failure_threshold = 3
|
||||
}
|
||||
# Allow time for GPU model loading at startup
|
||||
startup_probe {
|
||||
http_get {
|
||||
path = "/health"
|
||||
port = <port>
|
||||
}
|
||||
period_seconds = 10
|
||||
failure_threshold = 30 # up to 5 minutes
|
||||
}
|
||||
```
|
||||
|
||||
The liveness probe checks:
|
||||
- `nvidia-smi` — fails if GPU devices are no longer accessible (CUDA context corruption, device plugin issues)
|
||||
- `curl` health endpoint — fails if the application process is hung
|
||||
|
||||
If either fails 3 times in a row (3 minutes), Kubernetes automatically restarts the pod,
|
||||
which re-acquires the GPU device through the NVIDIA device plugin.
|
||||
|
||||
**Important**: Always pair with a `startup_probe` when using GPU workloads — model loading
|
||||
(TensorRT, ONNX, PyTorch) can take several minutes and would trip a liveness probe
|
||||
configured with a short `initial_delay_seconds`.
|
||||
|
||||
## References
|
||||
|
||||
- [NVIDIA Container Toolkit Troubleshooting](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html)
|
||||
- [Kubernetes GPU Device Plugin](https://github.com/NVIDIA/k8s-device-plugin)
|
||||
- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html)
|
||||
113
.claude/skills/archived/k8s-hpa-scaling-storm/SKILL.md
Normal file
113
.claude/skills/archived/k8s-hpa-scaling-storm/SKILL.md
Normal file
|
|
@ -0,0 +1,113 @@
|
|||
---
|
||||
name: k8s-hpa-scaling-storm
|
||||
description: |
|
||||
Fix and prevent HPA (HorizontalPodAutoscaler) scaling storms where pods scale to
|
||||
maxReplicas uncontrollably. Use when: (1) HPA shows memory or CPU utilization at
|
||||
200%+ causing rapid scale-up, (2) dozens or hundreds of pods created by HPA in minutes,
|
||||
(3) cluster becomes unstable due to resource exhaustion from too many pods,
|
||||
(4) etcd timeouts or API server crashes from pod churn, (5) adding resource requests
|
||||
to a deployment that previously had none causes HPA to miscalculate utilization.
|
||||
Covers emergency response and prevention patterns.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-15
|
||||
---
|
||||
|
||||
# Kubernetes HPA Scaling Storm
|
||||
|
||||
## Problem
|
||||
When an HPA is configured with a memory or CPU utilization target but the underlying
|
||||
deployment has insufficient resource requests, the HPA calculates artificially high
|
||||
utilization percentages (e.g., 220% of a 256Mi request when actual usage is 570Mi).
|
||||
This causes the HPA to scale pods to maxReplicas (often 100) within minutes, exhausting
|
||||
cluster resources and potentially crashing etcd and the API server.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- `kubectl get hpa` shows `<unknown>/70%` or very high percentages (200%+)
|
||||
- Pod count for a deployment rapidly increases to maxReplicas
|
||||
- etcd timeout errors in `kubectl` or `terraform apply`
|
||||
- API server becomes unreachable (`connection refused` or `network is unreachable`)
|
||||
- Adding resource requests to a Helm chart that previously had none
|
||||
- Memory-based HPA targets with real usage far exceeding requests
|
||||
|
||||
## Solution
|
||||
|
||||
### Emergency Response (stop the storm)
|
||||
|
||||
**Step 1: Delete the HPA immediately**
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config delete hpa <hpa-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Step 2: Scale the deployment down**
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config scale deployment <name> -n <namespace> --replicas=2
|
||||
```
|
||||
|
||||
**Step 3: Wait for pods to terminate and cluster to stabilize**
|
||||
```bash
|
||||
# Watch pod count decrease
|
||||
kubectl --kubeconfig $(pwd)/config get pods -n <namespace> -l <label> | wc -l
|
||||
```
|
||||
|
||||
If the API server is unresponsive, wait 3-5 minutes for it to self-recover. The kubelet
|
||||
will restart static pods (etcd, kube-apiserver) automatically.
|
||||
|
||||
### Prevention
|
||||
|
||||
**Rule 1: Set resource requests to match actual usage**
|
||||
Before enabling HPA, check actual resource consumption:
|
||||
```bash
|
||||
kubectl top pods -n <namespace> -l <label>
|
||||
```
|
||||
Set requests to the baseline (idle) usage, not the minimum possible value.
|
||||
|
||||
**Rule 2: Set reasonable maxReplicas**
|
||||
Never use maxReplicas > 10 unless you've verified the cluster can handle it.
|
||||
Default of 100 is almost never appropriate for a home/small cluster.
|
||||
|
||||
**Rule 3: Prefer CPU-only HPA targets**
|
||||
Memory-based scaling is problematic because:
|
||||
- Memory usage grows over time and rarely decreases
|
||||
- Memory-based scaling creates pods that never scale down
|
||||
- CPU is more responsive to load changes
|
||||
|
||||
**Rule 4: Test HPA changes on a deployment with 0 existing pods first**
|
||||
If adding resource requests to a deployment managed by HPA, temporarily disable
|
||||
the HPA first, set the requests, verify utilization is reasonable, then re-enable.
|
||||
|
||||
## Cascade Effects
|
||||
A scaling storm can cause:
|
||||
1. etcd storage exhaustion (too many pod objects)
|
||||
2. API server OOM or connection limits
|
||||
3. VPN/network connectivity loss (if VPN runs in the cluster)
|
||||
4. Kyverno webhook failures (admission controller overwhelmed)
|
||||
5. Other pods evicted or unable to schedule
|
||||
|
||||
## Verification
|
||||
- `kubectl get hpa -n <namespace>` shows reasonable utilization (< 100%)
|
||||
- Pod count is stable at expected replicas
|
||||
- `kubectl get nodes` responds promptly
|
||||
- No etcd timeout errors
|
||||
|
||||
## Example
|
||||
```bash
|
||||
# Observed: HPA scaling Collabora to 100 pods
|
||||
$ kubectl get hpa -n nextcloud
|
||||
NAME TARGETS MINPODS MAXPODS REPLICAS
|
||||
nextcloud-collabora cpu: 0%/70%, memory: 220%/50% 2 100 83
|
||||
|
||||
# Emergency fix
|
||||
$ kubectl delete hpa nextcloud-collabora -n nextcloud
|
||||
$ kubectl scale deployment nextcloud-collabora -n nextcloud --replicas=2
|
||||
|
||||
# Root cause: 256Mi memory request, actual usage 570Mi
|
||||
# Fix: increase request to 1Gi or disable memory target
|
||||
```
|
||||
|
||||
## Notes
|
||||
- If the HPA is managed by a Helm chart, deleting it via kubectl is temporary—the next
|
||||
Helm upgrade will recreate it. You must also update the Helm values.
|
||||
- In this project, Collabora was ultimately disabled in favor of OnlyOffice to avoid
|
||||
the HPA issue entirely.
|
||||
- See also: `helm-stuck-release-recovery` for fixing Helm releases broken by the storm.
|
||||
235
.claude/skills/archived/k8s-nfs-mount-troubleshooting/SKILL.md
Normal file
235
.claude/skills/archived/k8s-nfs-mount-troubleshooting/SKILL.md
Normal file
|
|
@ -0,0 +1,235 @@
|
|||
---
|
||||
name: k8s-nfs-mount-troubleshooting
|
||||
description: |
|
||||
Debug Kubernetes NFS volume mount failures. Use when: (1) Pod stuck in ContainerCreating
|
||||
for extended time, (2) kubectl describe shows "MountVolume.SetUp failed" with NFS errors,
|
||||
(3) Error message shows "Protocol not supported" or "mount.nfs: access denied",
|
||||
(4) NFS volume defined in pod spec but container won't start, (5) Container starts but
|
||||
gets "Permission denied" writing to NFS volume (non-root container UID mismatch),
|
||||
(6) CronJob or init container fails silently when writing to NFS, (7) Pod shows Running
|
||||
1/1 but service is unresponsive after a node reboot — stale NFS mount causes frozen
|
||||
processes with zero listening sockets. Common root causes are missing NFS export on the
|
||||
server, UID mismatch for non-root containers, and stale mounts after node reboots.
|
||||
author: Claude Code
|
||||
version: 1.2.0
|
||||
date: 2026-02-28
|
||||
---
|
||||
|
||||
# Kubernetes NFS Mount Troubleshooting
|
||||
|
||||
## Problem
|
||||
Pods with NFS volumes get stuck in `ContainerCreating` state indefinitely. The error
|
||||
messages from `kubectl describe pod` can be misleading, showing protocol or permission
|
||||
errors when the actual issue is the NFS export doesn't exist.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Pod status shows `ContainerCreating` for more than 1-2 minutes
|
||||
- `kubectl describe pod` shows events like:
|
||||
- `MountVolume.SetUp failed for volume "data" : mount failed: exit status 32`
|
||||
- `mount.nfs: Protocol not supported`
|
||||
- `mount.nfs: access denied by server`
|
||||
- Pod spec includes an NFS volume mount
|
||||
- Other pods on the same node work fine
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the NFS path
|
||||
```bash
|
||||
kubectl describe pod -n <namespace> <pod-name> | grep -A5 "Volumes:"
|
||||
```
|
||||
Look for the NFS server and path (e.g., `10.0.10.15:/mnt/main/myservice`)
|
||||
|
||||
### Step 2: Verify the export exists on NFS server
|
||||
SSH to the NFS server and check:
|
||||
```bash
|
||||
ssh root@<nfs-server> "ls -la /mnt/main/myservice"
|
||||
```
|
||||
|
||||
### Step 3: If directory doesn't exist, create it
|
||||
```bash
|
||||
ssh root@<nfs-server> "mkdir -p /mnt/main/myservice && chmod 777 /mnt/main/myservice"
|
||||
```
|
||||
|
||||
### Step 4: Add to NFS exports (TrueNAS specific)
|
||||
For TrueNAS, add the path to the NFS share configuration:
|
||||
1. Add directory to `scripts/nfs_directories.txt`
|
||||
2. Run `scripts/nfs_exports.sh` to update the share via API
|
||||
|
||||
### Step 5: Restart the pod
|
||||
```bash
|
||||
kubectl delete pod -n <namespace> -l app=<app-label>
|
||||
```
|
||||
The deployment will create a new pod that should now mount successfully.
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
kubectl get pods -n <namespace>
|
||||
# Should show 1/1 Running instead of 0/1 ContainerCreating
|
||||
|
||||
kubectl exec -n <namespace> <pod-name> -- ls -la /app/data
|
||||
# Should show the mounted directory contents
|
||||
```
|
||||
|
||||
## Example
|
||||
**Symptom:**
|
||||
```
|
||||
Events:
|
||||
Warning FailedMount 55s (x13 over 11m) kubelet MountVolume.SetUp failed for volume "data" : mount failed: exit status 32
|
||||
Mounting command: mount
|
||||
Mounting arguments: -t nfs 10.0.10.15:/mnt/main/resume /var/lib/kubelet/pods/.../data
|
||||
Output: mount.nfs: Protocol not supported
|
||||
```
|
||||
|
||||
**Root Cause:** The directory `/mnt/main/resume` didn't exist on the TrueNAS server.
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
ssh root@10.0.10.15 'mkdir -p /mnt/main/resume && chmod 777 /mnt/main/resume'
|
||||
# Then add to NFS exports and restart pod
|
||||
```
|
||||
|
||||
## Notes
|
||||
- The "Protocol not supported" error is misleading - it often means the export path doesn't exist
|
||||
- Always check the NFS server first before investigating protocol/firewall issues
|
||||
- For TrueNAS, the NFS share must be updated via API/UI after creating new directories
|
||||
- NFSv3 vs NFSv4 issues are rare in modern setups; missing paths are more common
|
||||
- Check that the NFS client packages are installed on Kubernetes nodes if this is a new cluster
|
||||
|
||||
## Variant: Non-Root Container UID Permission Denied
|
||||
|
||||
### Problem
|
||||
Container starts and mounts NFS successfully, but gets "Permission denied" when
|
||||
writing files. The pod appears healthy but operations fail silently.
|
||||
|
||||
### Trigger Conditions
|
||||
- Container logs show "Permission denied" or "client returned ERROR on write"
|
||||
- Pod is Running (not stuck in ContainerCreating)
|
||||
- NFS directory exists and is mounted, but owned by root (uid 0)
|
||||
- Container image runs as a non-root user (e.g., `curlimages/curl` runs as uid 101)
|
||||
- CronJobs or init containers that write to NFS fail with no obvious error
|
||||
|
||||
### Common Non-Root Container UIDs
|
||||
| Image | UID | User |
|
||||
|-------|-----|------|
|
||||
| `curlimages/curl` | 101 | curl_user |
|
||||
| `nginx` (unprivileged) | 101 | nginx |
|
||||
| `node` | 1000 | node |
|
||||
| `python` (slim) | 0 | root (safe) |
|
||||
| `grafana/grafana` | 472 | grafana |
|
||||
|
||||
### Solution
|
||||
Fix permissions on the NFS server:
|
||||
```bash
|
||||
# Option 1: World-writable (simplest, suitable for non-sensitive data)
|
||||
ssh root@10.0.10.15 "chmod -R 777 /mnt/main/<service>/<subdir>"
|
||||
|
||||
# Option 2: Match container UID (more secure)
|
||||
ssh root@10.0.10.15 "chown -R <uid>:<gid> /mnt/main/<service>/<subdir>"
|
||||
|
||||
# Option 3: Use securityContext in pod spec to run as root
|
||||
spec:
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
```
|
||||
|
||||
### Debugging
|
||||
```bash
|
||||
# Check what UID the container runs as
|
||||
kubectl exec -n <namespace> <pod> -- id
|
||||
|
||||
# Test write access from inside container
|
||||
kubectl exec -n <namespace> <pod> -- sh -c 'echo test > /path/to/nfs/testfile'
|
||||
|
||||
# Check NFS directory ownership on server
|
||||
ssh root@10.0.10.15 "ls -la /mnt/main/<service>/"
|
||||
```
|
||||
|
||||
## Variant: Stale NFS Mounts After Node Reboot (Ghost Running Pods)
|
||||
|
||||
### Problem
|
||||
After a node reboot (e.g., from kured rolling kernel updates), pods are rescheduled and
|
||||
show `Running 1/1` status, but the application process is frozen/hung. The service is
|
||||
completely unresponsive despite appearing healthy to Kubernetes.
|
||||
|
||||
### Trigger Conditions
|
||||
- Node was recently rebooted (check `kubectl get nodes` for age, or kured logs)
|
||||
- Pod shows `Running 1/1` with 0 restarts (looks perfectly healthy)
|
||||
- Service is unresponsive — Uptime Kuma or curl shows timeout/connection refused
|
||||
- `kubectl exec <pod> -- ss -tlnp` shows **zero listening sockets** (the process started but is hung)
|
||||
- Pod uses NFS volumes (inline `nfs {}` or PVC backed by NFS)
|
||||
- Multiple pods across different namespaces all exhibit the same symptom simultaneously
|
||||
- `kubectl describe pod` shows no warnings or errors — everything looks normal
|
||||
|
||||
### Root Cause
|
||||
When a node reboots, the NFS client mounts go stale. If the pod is rescheduled to the
|
||||
same or different node before NFS fully recovers, the application process starts but
|
||||
immediately hangs when it tries to access the NFS-mounted filesystem. The process is
|
||||
stuck in an uninterruptible I/O wait (D state) but Kubernetes sees the container as
|
||||
running because the PID exists and liveness probes (if any) may not exercise the NFS path.
|
||||
|
||||
### Solution
|
||||
Force-delete the affected pods to trigger a clean reschedule with fresh NFS mounts:
|
||||
|
||||
```bash
|
||||
# Identify hung pods — Running but no listening sockets
|
||||
kubectl exec -n <namespace> <pod> -- ss -tlnp 2>/dev/null
|
||||
# If output is empty or shows no expected ports, the pod is hung
|
||||
|
||||
# Force-delete to skip graceful shutdown (hung process won't respond to SIGTERM)
|
||||
kubectl delete pod -n <namespace> <pod> --force --grace-period=0
|
||||
|
||||
# The deployment controller creates a new pod with fresh NFS mounts
|
||||
kubectl get pods -n <namespace> -w
|
||||
```
|
||||
|
||||
For bulk remediation after a cluster-wide event:
|
||||
```bash
|
||||
# Find all pods with NFS volumes that might be hung
|
||||
# Check each service's expected port — if ss -tlnp shows nothing, force-delete
|
||||
for ns in calibre stirling-pdf send speedtest n8n paperless-ngx; do
|
||||
pod=$(kubectl get pod -n $ns -o name | head -1)
|
||||
sockets=$(kubectl exec -n $ns ${pod} -- ss -tlnp 2>/dev/null | wc -l)
|
||||
if [ "$sockets" -le 1 ]; then
|
||||
echo "HUNG: $ns/$pod (no listening sockets)"
|
||||
kubectl delete ${pod} -n $ns --force --grace-period=0
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# New pod should have listening sockets
|
||||
kubectl exec -n <namespace> <new-pod> -- ss -tlnp
|
||||
# Should show the application's expected port (e.g., *:8080)
|
||||
|
||||
# Service should respond
|
||||
kubectl exec -n <namespace> <new-pod> -- curl -sI http://localhost:<port>/
|
||||
# Should return HTTP response
|
||||
```
|
||||
|
||||
### Key Diagnostic Insight
|
||||
The critical signal is **Running 1/1 but zero listening sockets**. Normal healthy pods
|
||||
always have at least one listening socket for their application port. If `ss -tlnp`
|
||||
returns nothing, the process is hung on a stale NFS mount, not crashed — that's why
|
||||
Kubernetes thinks it's fine.
|
||||
|
||||
### Prevention
|
||||
- Add **liveness probes** that hit the application's HTTP endpoint (not just TCP connect):
|
||||
```hcl
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 60
|
||||
period_seconds = 30
|
||||
timeout_seconds = 5
|
||||
}
|
||||
```
|
||||
- This ensures Kubernetes detects hung pods and restarts them automatically.
|
||||
|
||||
## See Also
|
||||
- **nfsv4-idmapd-uid-mapping** — All UIDs show as 65534 (nobody) inside containers. Different from permission denied; the UIDs are wrong, not the permissions.
|
||||
- TrueNAS NFS configuration documentation
|
||||
- Kubernetes NFS volume documentation
|
||||
- k8s-limitrange-oom-silent-kill (for OOM issues often confused with NFS hangs)
|
||||
|
|
@ -0,0 +1,109 @@
|
|||
---
|
||||
name: kubelet-static-pod-manifest-update
|
||||
description: |
|
||||
Force kubelet to pick up changes to static pod manifests in /etc/kubernetes/manifests/.
|
||||
Use when: (1) edited kube-apiserver.yaml but the running process still has old flags,
|
||||
(2) kubelet restart doesn't pick up manifest changes, (3) touching the manifest file
|
||||
doesn't trigger pod recreation, (4) killing the API server process results in the
|
||||
same old args on restart, (5) the pod's config.hash annotation doesn't match the
|
||||
file's hash. Requires a full cycle: remove manifest, stop kubelet, remove containers,
|
||||
re-add manifest, start kubelet.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Kubelet Static Pod Manifest Update
|
||||
|
||||
## Problem
|
||||
After editing a static pod manifest (e.g., `/etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||
to add OIDC or audit flags), kubelet continues running the pod with the old configuration.
|
||||
Standard approaches like `touch`, `systemctl restart kubelet`, or `kubectl delete pod`
|
||||
do not force kubelet to reconcile the new manifest.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Edited `/etc/kubernetes/manifests/kube-apiserver.yaml` (or other static pod manifests)
|
||||
- The running process (`ps aux | grep kube-apiserver`) shows old flags
|
||||
- `kubectl get pod -n kube-system kube-apiserver-* -o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'` returns a stale hash
|
||||
- Any of these actions failed to apply the changes:
|
||||
- `touch /etc/kubernetes/manifests/kube-apiserver.yaml`
|
||||
- `systemctl restart kubelet`
|
||||
- `kubectl delete pod kube-apiserver-*`
|
||||
- Killing the API server process directly
|
||||
|
||||
## Root Cause
|
||||
Kubelet maintains an internal cache of static pod specs keyed by a hash of the manifest.
|
||||
When the manifest changes, kubelet should detect the new hash and recreate the pod.
|
||||
However, in practice (observed on Kubernetes 1.34.x), kubelet can get stuck with the
|
||||
old hash if:
|
||||
- The pod's mirror object in the API server still exists with the old hash
|
||||
- Kubelet's internal pod cache wasn't cleared between restarts
|
||||
- The container runtime (containerd) still has the old container running
|
||||
|
||||
## Solution
|
||||
|
||||
Full restart cycle on the master node:
|
||||
|
||||
```bash
|
||||
# 1. Back up the manifest
|
||||
sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml.bak
|
||||
|
||||
# 2. Remove the manifest (kubelet will stop the pod)
|
||||
sudo rm /etc/kubernetes/manifests/kube-apiserver.yaml
|
||||
|
||||
# 3. Stop kubelet
|
||||
sudo systemctl stop kubelet
|
||||
|
||||
# 4. Wait for the API server container to stop
|
||||
sleep 5
|
||||
|
||||
# 5. Force-remove any remaining API server containers
|
||||
sudo crictl rm -f $(sudo crictl ps -aq --name kube-apiserver 2>/dev/null) 2>/dev/null
|
||||
|
||||
# 6. Re-add the manifest (with your changes)
|
||||
sudo cp /tmp/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml
|
||||
|
||||
# 7. Start kubelet
|
||||
sudo systemctl start kubelet
|
||||
|
||||
# 8. Wait for API server to come up (30-60 seconds)
|
||||
sleep 45
|
||||
|
||||
# 9. Verify new flags are active
|
||||
sudo cat /proc/$(pgrep -f 'kube-apiserver --' | head -1)/cmdline | tr '\0' '\n' | grep 'your-new-flag'
|
||||
```
|
||||
|
||||
**Critical:** The order matters. Removing the manifest BEFORE stopping kubelet ensures
|
||||
kubelet processes the removal. Then clearing containers ensures no stale state. Finally,
|
||||
re-adding the manifest with kubelet running triggers a fresh pod creation.
|
||||
|
||||
## What Does NOT Work
|
||||
|
||||
| Approach | Why it fails |
|
||||
|----------|-------------|
|
||||
| `touch manifest.yaml` | Kubelet may not detect mtime-only changes |
|
||||
| `systemctl restart kubelet` | Kubelet reuses cached pod spec if hash matches |
|
||||
| `kubectl delete pod` | Deletes mirror pod but kubelet recreates from cached spec |
|
||||
| `kill <apiserver-pid>` | Container runtime restarts the same container with old args |
|
||||
| Moving manifest away and back without stopping kubelet | Kubelet may cache the old spec in memory |
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Check the running process has new flags
|
||||
ps aux | grep kube-apiserver | grep -v grep | grep 'your-new-flag'
|
||||
|
||||
# Check the config hash changed
|
||||
kubectl get pod -n kube-system kube-apiserver-$(hostname) \
|
||||
-o jsonpath='{.metadata.annotations.kubernetes\.io/config\.hash}'
|
||||
|
||||
# Check API server logs for successful startup
|
||||
kubectl logs -n kube-system kube-apiserver-$(hostname) | tail -5
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This applies to ALL static pods, not just kube-apiserver (etcd, controller-manager, scheduler)
|
||||
- The cluster will be briefly unavailable during the restart (30-60 seconds)
|
||||
- On single-master clusters, kubectl commands will fail during the restart — use `sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf` from the master
|
||||
- Always validate the YAML before removing the manifest: `python3 -c "import yaml; yaml.safe_load(open('/etc/kubernetes/manifests/kube-apiserver.yaml'))"`
|
||||
- See also: `authentik-oidc-kubernetes` skill for the full OIDC setup context
|
||||
143
.claude/skills/archived/local-llm-gpu-selection/SKILL.md
Normal file
143
.claude/skills/archived/local-llm-gpu-selection/SKILL.md
Normal file
|
|
@ -0,0 +1,143 @@
|
|||
---
|
||||
name: local-llm-gpu-selection
|
||||
description: |
|
||||
Guide for selecting GPUs and hardware for local LLM inference on Dell R730 and
|
||||
comparing to Apple Silicon alternatives. Use when: (1) user asks about running
|
||||
local models (Ollama, llama.cpp), (2) user asks which GPU to buy for LLMs,
|
||||
(3) user wants to compare local models to Claude for coding, (4) user asks about
|
||||
quantized model selection, (5) user asks about Mac Mini/Studio vs GPU server for
|
||||
LLMs. Covers VRAM requirements, memory bandwidth as key metric, R730 GPU compatibility,
|
||||
multi-GPU considerations, and realistic quality comparisons to Claude models.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-06-11
|
||||
---
|
||||
|
||||
# Local LLM GPU Selection & Performance Guide
|
||||
|
||||
## Problem
|
||||
Choosing the right hardware for local LLM inference requires understanding the
|
||||
relationship between VRAM capacity, memory bandwidth, GPU compatibility with
|
||||
server chassis, and realistic model quality expectations.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- User asks about running quantized models locally (Ollama, llama.cpp)
|
||||
- User wants to know which GPU fits their server (Dell R730 or similar 2U)
|
||||
- User asks about Apple Silicon (Mac Mini/Studio) vs datacenter GPUs for LLMs
|
||||
- User wants to compare local model quality to Claude (Opus/Sonnet/Haiku) for coding
|
||||
|
||||
## Key Principle: Memory Bandwidth Is Everything
|
||||
|
||||
LLM token generation is **memory-bandwidth bound**, not compute bound. The formula:
|
||||
```
|
||||
approx tokens/sec = memory_bandwidth_GB_s / model_size_GB
|
||||
```
|
||||
This is why Apple Silicon (high bandwidth unified memory) competes with datacenter GPUs
|
||||
despite having less raw compute.
|
||||
|
||||
## VRAM Requirements by Model Size
|
||||
|
||||
| Model Size | Quant | VRAM Needed | Examples |
|
||||
|------------|-------|-------------|----------|
|
||||
| 7-8B | Q4_K_M | ~5 GB | Llama 3.1 8B, Mistral 7B |
|
||||
| 7-8B | Q8_0 | ~8 GB | |
|
||||
| 13-14B | Q4_K_M | ~8 GB | Qwen 2.5 Coder 14B |
|
||||
| 22-24B | Q4_K_M | ~13-14 GB | Mistral Small, Codestral |
|
||||
| 32B | Q4_K_M | ~20 GB | Qwen 2.5 Coder 32B |
|
||||
| 32B | Q8_0 | ~34 GB | |
|
||||
| 70B | Q4_K_M | ~40 GB | Llama 3.1 70B |
|
||||
| 70B | Q8_0 | ~70 GB | |
|
||||
|
||||
Add ~1-2 GB overhead for KV cache and context. Longer conversations use more.
|
||||
|
||||
## Dell R730 GPU Compatibility
|
||||
|
||||
### Constraints
|
||||
- **2U chassis**: Full-height cards fit, but limited to dual-slot width
|
||||
- **PCIe 3.0 x16 slots**: 2-3 usable slots depending on riser configuration
|
||||
- **Power**: Needs Dell GPU power cable (P/N 0D4J0T) for GPUs >75W TDP
|
||||
- **PSU**: Check wattage headroom (dual 750W or 1100W typical)
|
||||
|
||||
### Compatible GPUs
|
||||
|
||||
**No external power needed (<=75W):**
|
||||
- Tesla T4: 16 GB, 320 GB/s, 70W — best drop-in option
|
||||
- Tesla P4: 8 GB, 192 GB/s, 75W — too little VRAM for modern LLMs
|
||||
- NVIDIA L4: 24 GB, 300 GB/s, 72W — T4 successor, Ada Lovelace, expensive
|
||||
- NVIDIA A2: 16 GB, 200 GB/s, 60W — worse than T4 in every way, avoid
|
||||
|
||||
**Requires power cable (>75W):**
|
||||
- Tesla P40: 24 GB, 346 GB/s, 250W — best value per GB
|
||||
- Tesla V100 PCIe: 32 GB, 900 GB/s, 250W — excellent bandwidth
|
||||
- Tesla P100 PCIe: 16 GB, 732 GB/s, 250W — same VRAM as T4, not worth it
|
||||
|
||||
**Won't fit:**
|
||||
- RTX 3090/4090: Too thick (3-slot), too long
|
||||
- A100: Fits physically but very expensive
|
||||
- Any consumer RTX: Generally too large for 2U
|
||||
|
||||
### Multi-GPU Considerations
|
||||
- Ollama splits model layers across GPUs automatically
|
||||
- PCIe 3.0 cross-GPU transfer adds ~30-40% latency penalty
|
||||
- Mismatched GPUs (e.g., T4 + P40) work but the slower card bottlenecks
|
||||
- R730 PCIe 3.0 limits newer GPU bandwidth (L4 runs at half its rated speed)
|
||||
|
||||
## Apple Silicon Comparison
|
||||
|
||||
Apple Silicon unified memory means ALL system RAM = VRAM with no bus penalty.
|
||||
|
||||
| Device | Memory | Bandwidth | Advantage |
|
||||
|--------|--------|-----------|-----------|
|
||||
| Mac Mini M4 Pro 48 GB | 48 GB | 273 GB/s | Silent, 25W, no PCIe penalty |
|
||||
| Mac Studio M4 Max 128 GB | 128 GB | 546 GB/s | Run 100B+ models |
|
||||
| Mac Studio M4 Ultra 192 GB | 192 GB | 819 GB/s | Run anything |
|
||||
|
||||
A Mac Mini M4 Pro 48GB often matches or beats a T4+L4 multi-GPU setup for
|
||||
LLM inference due to zero cross-GPU overhead and high unified bandwidth.
|
||||
|
||||
## Best Coding Models (for Ollama)
|
||||
|
||||
For coding tasks specifically, prefer dedicated coding models:
|
||||
1. **Qwen 2.5 Coder 32B** — best open-source coding model in this size class
|
||||
2. **Codestral 22B** — Mistral's dedicated coding model
|
||||
3. **DeepSeek Coder V2** — good quality, efficient
|
||||
4. **Llama 3.1 70B** — strong general purpose but needs ~40 GB
|
||||
|
||||
## Realistic Quality Comparison to Claude
|
||||
|
||||
For Claude Code-style agentic coding workflows:
|
||||
|
||||
| Capability | Opus/Sonnet | Haiku | Qwen 2.5 Coder 32B | 70B General |
|
||||
|-----------|-------------|-------|---------------------|-------------|
|
||||
| Single function gen | Excellent | Good | Good | Decent |
|
||||
| Multi-file refactoring | Excellent | Decent | Weak | Weak |
|
||||
| Tool use / agentic loops | Excellent | Good | Poor | Poor |
|
||||
| Long context (large codebases) | Excellent | Good | Weak | Weak |
|
||||
|
||||
Local models work for simple completions and code questions. They struggle badly
|
||||
with Claude Code's complex multi-step tool-use workflows, long context windows,
|
||||
and self-correction capabilities.
|
||||
|
||||
## Quantization Quality Guide
|
||||
|
||||
From best to worst quality (and largest to smallest):
|
||||
- FP16: Full precision, baseline quality
|
||||
- Q8_0: Near-lossless, ~50% size reduction
|
||||
- Q6_K: Minimal quality loss
|
||||
- Q5_K_M: Good balance
|
||||
- Q4_K_M: **Recommended default** — best quality/size tradeoff
|
||||
- Q3_K_M: Noticeable degradation on complex reasoning
|
||||
- Q2_K: Significant quality loss, emergency only
|
||||
|
||||
## Verification
|
||||
- Check GPU compatibility: `lspci | grep -i nvidia` on the host
|
||||
- Check available VRAM: `nvidia-smi` inside the GPU VM
|
||||
- Check model fit: Ollama shows VRAM usage during `ollama run`
|
||||
- Check inference speed: Count tokens/sec in Ollama output
|
||||
|
||||
## Notes
|
||||
- GPU prices fluctuate significantly in the used market; check current prices
|
||||
- The T4 is PCIe 3.0 only; newer GPUs in PCIe 3.0 slots run at reduced bandwidth
|
||||
- Power consumption matters for 24/7 homelab use (electricity cost)
|
||||
- For Claude Code specifically, API-based Claude models remain significantly
|
||||
superior to any local model for agentic coding workflows
|
||||
143
.claude/skills/archived/loki-helm-deployment-pitfalls/SKILL.md
Normal file
143
.claude/skills/archived/loki-helm-deployment-pitfalls/SKILL.md
Normal file
|
|
@ -0,0 +1,143 @@
|
|||
---
|
||||
name: loki-helm-deployment-pitfalls
|
||||
description: |
|
||||
Fix common Loki Helm chart deployment failures on Kubernetes with Terraform.
|
||||
Use when: (1) Loki pod fails with "mkdir: read-only file system" for compactor
|
||||
or ruler paths, (2) Helm chart fails with "Helm test requires the Loki Canary
|
||||
to be enabled", (3) Helm install fails with "cannot re-use a name that is still
|
||||
in use" after a failed atomic deploy, (4) PV stuck in Released state after failed
|
||||
Helm install, (5) "entry too far behind" errors flooding Loki logs after initial
|
||||
Alloy deployment. Covers single-binary mode with filesystem storage on NFS.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Loki Helm Chart Deployment Pitfalls
|
||||
|
||||
## Problem
|
||||
Deploying the Grafana Loki Helm chart in single-binary mode with Terraform hits
|
||||
multiple non-obvious failures that aren't documented together.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying Loki via `helm_release` in Terraform
|
||||
- Using `deploymentMode: SingleBinary` with filesystem storage on NFS
|
||||
- First-time deployment or redeployment after failures
|
||||
|
||||
## Pitfall 1: Read-Only Root Filesystem
|
||||
|
||||
**Error:** `mkdir /loki/compactor: read-only file system`
|
||||
|
||||
**Cause:** The Loki Helm chart runs containers with a read-only root filesystem
|
||||
for security. The compactor `working_directory` and ruler `rule_path` default to
|
||||
paths under `/loki/` which is on the read-only root FS.
|
||||
|
||||
**Fix:** Use paths under `/var/loki/` — the Helm chart mounts the persistence
|
||||
volume there:
|
||||
```yaml
|
||||
compactor:
|
||||
working_directory: /var/loki/compactor # NOT /loki/compactor
|
||||
ruler:
|
||||
rule_path: /var/loki/scratch # NOT /loki/scratch
|
||||
```
|
||||
|
||||
## Pitfall 2: Canary Required
|
||||
|
||||
**Error:** `Helm test requires the Loki Canary to be enabled`
|
||||
|
||||
**Cause:** The Loki Helm chart's validation template requires `lokiCanary.enabled`
|
||||
to be true. You cannot disable it.
|
||||
|
||||
**Fix:** Leave `lokiCanary` enabled (default). You can disable `gateway`,
|
||||
`chunksCache`, and `resultsCache` to reduce resource usage:
|
||||
```yaml
|
||||
gateway:
|
||||
enabled: false
|
||||
chunksCache:
|
||||
enabled: false
|
||||
resultsCache:
|
||||
enabled: false
|
||||
# Do NOT add: lokiCanary: enabled: false
|
||||
```
|
||||
|
||||
## Pitfall 3: Stale Helm Release After Failed Atomic Deploy
|
||||
|
||||
**Error:** `cannot re-use a name that is still in use`
|
||||
|
||||
**Cause:** When `atomic = true` and the deploy fails, Helm rolls back but
|
||||
sometimes leaves a stale release secret in Kubernetes. Terraform then can't
|
||||
create a new release with the same name.
|
||||
|
||||
**Fix:** Delete the stale Helm secret:
|
||||
```bash
|
||||
kubectl delete secret -n monitoring sh.helm.release.v1.loki.v1
|
||||
```
|
||||
Also consider removing `atomic = true` for initial deployments and adding it
|
||||
back after the first successful install. Use a longer `timeout` (600s+) for
|
||||
first deploy since image pulls take time.
|
||||
|
||||
## Pitfall 4: PV Stuck in Released State
|
||||
|
||||
**Symptom:** PV shows `Released` status, PVC can't bind, Loki pod stuck in Pending.
|
||||
|
||||
**Cause:** After a failed Helm deploy, the PVC is deleted but the PV retains a
|
||||
`claimRef` to the old PVC. New PVCs can't bind to a `Released` PV.
|
||||
|
||||
**Fix:** Clear the stale claimRef:
|
||||
```bash
|
||||
kubectl patch pv loki --type json -p '[{"op": "remove", "path": "/spec/claimRef"}]'
|
||||
```
|
||||
The PV will transition from `Released` to `Available` and can be bound again.
|
||||
|
||||
## Pitfall 5: "Entry Too Far Behind" Log Spam
|
||||
|
||||
**Error:** `entry too far behind, entry timestamp is: ... oldest acceptable timestamp is: ...`
|
||||
|
||||
**Cause:** Alloy reads all historical log files from the Kubernetes API on first
|
||||
startup. Old entries are rejected by Loki's ingester because they're behind the
|
||||
newest entry for that stream.
|
||||
|
||||
**Fix:** This is harmless and self-resolving — Alloy catches up to present time
|
||||
and errors stop. To clear immediately:
|
||||
```bash
|
||||
kubectl rollout restart ds -n monitoring alloy
|
||||
```
|
||||
After restart, Alloy tails from approximately "now" for each container.
|
||||
|
||||
## Pitfall 6: Alertmanager Service Name
|
||||
|
||||
**Symptom:** Loki ruler alerts never fire despite correct LogQL rules.
|
||||
|
||||
**Cause:** The Prometheus Helm chart names the Alertmanager service
|
||||
`prometheus-alertmanager`, not `alertmanager`. Using the wrong name causes
|
||||
silent alert delivery failures.
|
||||
|
||||
**Fix:**
|
||||
```yaml
|
||||
ruler:
|
||||
alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
|
||||
```
|
||||
Verify the actual service name: `kubectl get svc -n monitoring | grep alertmanager`
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Loki pod running
|
||||
kubectl get pods -n monitoring -l app.kubernetes.io/name=loki
|
||||
|
||||
# Loki receiving logs
|
||||
kubectl port-forward -n monitoring svc/loki 3100:3100 &
|
||||
curl -s 'http://localhost:3100/loki/api/v1/labels'
|
||||
# Should return JSON with namespace, pod, container labels
|
||||
|
||||
# PV bound
|
||||
kubectl get pv loki
|
||||
# STATUS should be "Bound"
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Always check PV status before retrying a failed deploy
|
||||
- The Loki Helm chart creates many components by default (gateway, canary,
|
||||
memcached caches) — disable what you don't need for single-binary mode
|
||||
- WAL directory can be on tmpfs (emptyDir with `medium: Memory`) for
|
||||
disk-friendly setups, but data is lost on pod crash
|
||||
- See also: `helm-release-force-rerender` for Helm values not updating resources
|
||||
|
|
@ -0,0 +1,148 @@
|
|||
---
|
||||
name: music-assistant-librespot-wrong-account
|
||||
description: |
|
||||
Fix for Music Assistant Spotify playback failing with "librespot does not support free
|
||||
accounts" even when the Spotify account has Premium. Use when: (1) Songs load for 1-2
|
||||
seconds then auto-pause, (2) Music Assistant logs show "librespot does not support free
|
||||
accounts" followed by FFmpeg "Invalid data found when processing input" exit code 183,
|
||||
(3) Spotify provider shows "Successfully logged in" but streaming fails. Root cause is
|
||||
stale librespot credential cache pointing to a different (free-tier) Spotify account.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# Music Assistant Librespot Wrong Account / Stale Credentials
|
||||
|
||||
## Problem
|
||||
Music Assistant (MASS) Spotify playback fails immediately — songs appear to load for 1-2
|
||||
seconds then auto-pause. Every track is marked "unplayable". The error log shows librespot
|
||||
rejecting the account as "free" despite the configured Spotify account having Premium.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Music Assistant addon on Home Assistant (tested with v2.7.8, addon `d5369777_music_assistant`)
|
||||
- Symptoms: Song starts loading, pauses after 1-2 seconds, skipped as "unplayable"
|
||||
- Log pattern (all three appear together on every play attempt):
|
||||
```
|
||||
WARNING [music_assistant.spotify] [librespot] librespot does not support "free" accounts.
|
||||
WARNING [music_assistant.audio.media_stream] Error opening input: Invalid data found when processing input
|
||||
ERROR [music_assistant.streams] AudioError while streaming queue item ... FFMpeg exited with code 183
|
||||
```
|
||||
- OAuth login succeeds: `Successfully logged in to Spotify as <Name>`
|
||||
- But librespot streaming fails with the "free" account error
|
||||
|
||||
## Root Cause
|
||||
Music Assistant uses **two separate auth mechanisms** for Spotify:
|
||||
1. **OAuth (PKCE flow)** — for browsing, search, metadata. Uses access tokens refreshed via
|
||||
the Spotify Web API. This is what produces the "Successfully logged in" message.
|
||||
2. **Librespot** — for actual audio streaming. Uses cached credentials stored in
|
||||
`/data/.cache/spotify--<id>/credentials.json` inside the addon container.
|
||||
|
||||
The librespot credential cache can become stale or point to a **different Spotify account**
|
||||
(e.g., if another family member logged in, or credentials were cached from before a Premium
|
||||
upgrade). Librespot uses these cached credentials to connect to Spotify's internal API, which
|
||||
returns a `ProductInfo` XML packet containing the account `type`. If the cached account is
|
||||
"free", librespot calls `exit(1)`, killing the audio pipeline before FFmpeg receives any data.
|
||||
|
||||
## How Librespot Determines Account Type
|
||||
Librespot reads the `type` field from Spotify's `ProductInfo` server packet
|
||||
(`librespot-org/librespot`, `core/src/session.rs`):
|
||||
```rust
|
||||
fn check_catalogue(attributes: &UserAttributes) {
|
||||
if let Some(account_type) = attributes.get("type") {
|
||||
if account_type != "premium" {
|
||||
error!("librespot does not support {account_type:?} accounts.");
|
||||
exit(1);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
The check is an exact string match against `"premium"`.
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Verify the Problem
|
||||
Check Music Assistant addon logs for the "free accounts" error:
|
||||
```bash
|
||||
# Via HA API (from a machine with the HA token)
|
||||
python3 -c "
|
||||
import os, json, requests
|
||||
url = os.environ.get('HOME_ASSISTANT_SOFIA_URL', '').rstrip('/')
|
||||
token = os.environ.get('HOME_ASSISTANT_SOFIA_TOKEN', '')
|
||||
headers = {'Authorization': f'Bearer {token}'}
|
||||
r = requests.get(f'{url}/api/hassio/addons/d5369777_music_assistant/logs', headers=headers)
|
||||
for line in r.text.split('\n'):
|
||||
if 'free' in line.lower() or 'librespot' in line.lower():
|
||||
print(line)
|
||||
"
|
||||
```
|
||||
|
||||
### Step 2: Identify the Music Assistant Container
|
||||
From the SSH addon (ha-sofia: `ssh vbarzin@192.168.1.8`):
|
||||
```bash
|
||||
sudo curl -s --unix-socket /run/docker.sock http://localhost/containers/json | \
|
||||
python3 -c "import sys,json; [print(c['Names'][0], c['Id'][:12]) for c in json.load(sys.stdin) if 'music' in c['Names'][0].lower()]"
|
||||
```
|
||||
|
||||
### Step 3: Check Cached Credentials
|
||||
Exec into the container to read the librespot cache:
|
||||
```bash
|
||||
# Create exec
|
||||
EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/exec" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"Cmd":["cat","/data/.cache/spotify--5s3mSP8y/credentials.json"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
|
||||
|
||||
# Run exec
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/exec/$EXEC_ID/start" \
|
||||
-H 'Content-Type: application/json' -d '{"Detach":false}'
|
||||
```
|
||||
Check the `username` field — if it doesn't match the expected Premium account, that's the problem.
|
||||
|
||||
### Step 4: Clear the Cache
|
||||
```bash
|
||||
# Create exec to delete cache
|
||||
EXEC_ID=$(sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/exec" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"Cmd":["rm","-rf","/data/.cache/spotify--5s3mSP8y"],"AttachStdout":true,"AttachStderr":true}' | python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
|
||||
|
||||
# Run exec
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/exec/$EXEC_ID/start" \
|
||||
-H 'Content-Type: application/json' -d '{"Detach":false}'
|
||||
```
|
||||
|
||||
### Step 5: Restart Music Assistant
|
||||
```bash
|
||||
sudo curl -s --unix-socket /run/docker.sock \
|
||||
"http://localhost/containers/<CONTAINER_ID>/restart" -X POST
|
||||
```
|
||||
|
||||
### Step 6: Verify
|
||||
After restart, check logs for:
|
||||
- `Successfully logged in to Spotify as <Name>` (OAuth OK)
|
||||
- No "free accounts" error when playing a track
|
||||
- Optionally re-check `/data/.cache/spotify--5s3mSP8y/credentials.json` to confirm the
|
||||
`username` now matches the Premium account
|
||||
|
||||
## Verification
|
||||
1. Play any Spotify track through Music Assistant
|
||||
2. The track should stream without pausing after 1-2 seconds
|
||||
3. Logs should show `Start Queue Flow stream` without subsequent `AudioError`
|
||||
|
||||
## Notes
|
||||
- The cache directory name `spotify--5s3mSP8y` is an internal Music Assistant provider ID
|
||||
and may differ across installations. Use `find /data -name credentials.json` to locate it.
|
||||
- The `username` field in the credentials cache is Spotify's internal user ID (numeric for
|
||||
newer accounts, text for older ones), not necessarily the display name or email.
|
||||
- Spotify Family plan **owners** have account type `"premium"`. Family plan **members** also
|
||||
report as `"premium"` when their membership is active.
|
||||
- If the problem recurs, it may indicate that Music Assistant's Spotify provider re-caches
|
||||
the wrong credentials — check if multiple Spotify accounts are configured or if another
|
||||
user logged in via the Music Assistant UI.
|
||||
- The SSH addon on HA OS needs `sudo` for Docker socket access (`/run/docker.sock` is owned
|
||||
by `root:messagebus`).
|
||||
- The HA long-lived token typically does NOT have Supervisor API access (hassio endpoints
|
||||
return 401), so addon management must go through the Docker socket from the SSH addon.
|
||||
128
.claude/skills/archived/nextcloud-calendar/SKILL.md
Normal file
128
.claude/skills/archived/nextcloud-calendar/SKILL.md
Normal file
|
|
@ -0,0 +1,128 @@
|
|||
---
|
||||
name: nextcloud-calendar
|
||||
description: |
|
||||
Create, list, and query calendar events in Nextcloud via CalDAV. Use when:
|
||||
(1) User asks to create a calendar event, (2) User asks what's on their calendar,
|
||||
(3) User says "add to calendar" or "schedule", (4) User asks about upcoming events.
|
||||
Always use Nextcloud calendar unless user specifies otherwise.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-25
|
||||
---
|
||||
|
||||
# Nextcloud Calendar Management
|
||||
|
||||
## Problem
|
||||
Need to create, query, or manage calendar events in the user's Nextcloud calendar.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- User asks to create/add a calendar event
|
||||
- User asks "what's on my calendar?" or similar
|
||||
- User mentions scheduling something
|
||||
- User says "remind me" with a date (create calendar event)
|
||||
- Default calendar is always Nextcloud unless otherwise specified
|
||||
|
||||
## Prerequisites
|
||||
- Python 3 with `caldav` and `icalendar` packages available (installed via PYTHONPATH or system packages)
|
||||
- Environment variables `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` must be set
|
||||
|
||||
## Solution
|
||||
|
||||
### Script Location
|
||||
```
|
||||
.claude/calendar-query.py
|
||||
```
|
||||
|
||||
### Execution Pattern (CRITICAL)
|
||||
Run the script directly with python3 (env vars are set in the environment):
|
||||
|
||||
```bash
|
||||
python3 .claude/calendar-query.py [command] [options]
|
||||
```
|
||||
|
||||
### Available Commands
|
||||
|
||||
#### List Calendars
|
||||
```bash
|
||||
python .claude/calendar-query.py list
|
||||
```
|
||||
|
||||
#### Query Events
|
||||
```bash
|
||||
# Today's events
|
||||
python .claude/calendar-query.py today
|
||||
|
||||
# Tomorrow's events
|
||||
python .claude/calendar-query.py tomorrow
|
||||
|
||||
# This week
|
||||
python .claude/calendar-query.py week
|
||||
|
||||
# This month
|
||||
python .claude/calendar-query.py month
|
||||
|
||||
# Custom date range
|
||||
python .claude/calendar-query.py events --days 14
|
||||
python .claude/calendar-query.py events --date 2026-04-10
|
||||
|
||||
# From specific calendar
|
||||
python .claude/calendar-query.py today --calendar "Work"
|
||||
```
|
||||
|
||||
#### Create Events
|
||||
```bash
|
||||
# All-day event (single day)
|
||||
python .claude/calendar-query.py create --title "Doctor appointment" --start "2026-03-15" --all-day
|
||||
|
||||
# All-day event (multi-day) - end date is EXCLUSIVE
|
||||
# For April 10-13, use end date April 14
|
||||
python .claude/calendar-query.py create --title "Vacation" --start "2026-04-10" --end "2026-04-14" --all-day
|
||||
|
||||
# Timed event
|
||||
python .claude/calendar-query.py create --title "Meeting" --start "2026-03-15 14:00" --end "2026-03-15 15:00"
|
||||
|
||||
# With location and description
|
||||
python .claude/calendar-query.py create --title "Lunch" --start "tomorrow 12:00" --location "Cafe" --description "Team lunch"
|
||||
|
||||
# Relative dates work
|
||||
python .claude/calendar-query.py create --title "Call" --start "today 16:00"
|
||||
python .claude/calendar-query.py create --title "Review" --start "tomorrow 10:00"
|
||||
```
|
||||
|
||||
### Output Formats
|
||||
```bash
|
||||
# JSON output (for parsing)
|
||||
python .claude/calendar-query.py today --json
|
||||
|
||||
# Text output (default, human-readable)
|
||||
python .claude/calendar-query.py week
|
||||
```
|
||||
|
||||
## Complete Example
|
||||
|
||||
To create an event "Team offsite" from March 20-22, 2026:
|
||||
|
||||
```bash
|
||||
python3 .claude/calendar-query.py create --title "Team offsite" --start "2026-03-20" --end "2026-03-23" --all-day
|
||||
```
|
||||
|
||||
## Important Notes
|
||||
|
||||
1. **End dates are exclusive** for all-day events (CalDAV standard). To create an event spanning April 10-13, set end to April 14.
|
||||
|
||||
2. **No delete/update commands** - The script currently only supports create and query. To modify events, user must do it manually in Nextcloud.
|
||||
|
||||
4. **Default calendar** is "Personal" - use `--calendar` flag for others.
|
||||
|
||||
## Verification
|
||||
- For queries: Output shows formatted event list
|
||||
- For creates: Output shows "Event created: [title]" with calendar name and start date
|
||||
- Exit code 0 = success, 1 = error (check output for details)
|
||||
|
||||
## Common Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| `NEXTCLOUD_USER and NEXTCLOUD_APP_PASSWORD must be set` | Env vars not set | Ensure `NEXTCLOUD_USER` and `NEXTCLOUD_APP_PASSWORD` are in the environment |
|
||||
| `Required packages not installed` | caldav/icalendar missing | Ensure PYTHONPATH includes the installed packages |
|
||||
| `Calendar 'X' not found` | Wrong calendar name | Run `list` command to see available calendars |
|
||||
132
.claude/skills/archived/nfsv4-idmapd-uid-mapping/SKILL.md
Normal file
132
.claude/skills/archived/nfsv4-idmapd-uid-mapping/SKILL.md
Normal file
|
|
@ -0,0 +1,132 @@
|
|||
---
|
||||
name: nfsv4-idmapd-uid-mapping
|
||||
description: |
|
||||
Fix for all file UIDs showing as 65534 (nobody) inside Kubernetes containers when using
|
||||
NFS volumes from TrueNAS/FreeBSD. Use when: (1) ls -lan inside a container shows all files
|
||||
owned by 65534:65534 despite correct ownership on the NFS server, (2) PostgreSQL fails with
|
||||
"data directory has wrong ownership", (3) chown inside containers returns "Invalid argument"
|
||||
on NFS volumes, (4) services that check file ownership (PostgreSQL, MySQL) crash on startup,
|
||||
(5) the same NFS mount shows correct UIDs on the host but 65534 inside containers,
|
||||
(6) NFSv4.2 appears in container mount output even though host mounts use NFSv3.
|
||||
Root cause: Kubernetes inline NFS volumes auto-negotiate NFSv4.2 (not NFSv3), and NFSv4
|
||||
idmapd fails to map UIDs when domains don't match or users don't exist on the server.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# NFSv4 idmapd UID Mapping — All Files Show as nobody (65534)
|
||||
|
||||
## Problem
|
||||
All files on NFS volumes appear owned by UID 65534 (nobody:nogroup) inside Kubernetes
|
||||
containers, even though `ls -lan` on the NFS server shows the correct UIDs (e.g., 999, 472).
|
||||
This breaks any service that checks file ownership: PostgreSQL refuses to start ("data
|
||||
directory has wrong ownership"), MySQL's entrypoint `chown` fails with "Invalid argument",
|
||||
and any `chown` inside the container returns EINVAL.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- TrueNAS CORE (FreeBSD) or TrueNAS SCALE as NFS server
|
||||
- NFSv4 enabled on the NFS server (`v4: true` in TrueNAS NFS config)
|
||||
- Kubernetes using inline NFS volumes (not PV/PVC with mount options)
|
||||
- **Key symptom**: `mount` inside the container shows `type nfs4 (vers=4.2,...)` even
|
||||
though existing kubelet mounts on the host show `vers=3`
|
||||
- **Key symptom**: Same NFS path mounted directly on the host shows correct UIDs, but
|
||||
inside any container shows 65534
|
||||
|
||||
## Root Cause
|
||||
|
||||
Kubernetes inline NFS volumes don't support `mountOptions`. When kubelet mounts NFS for a
|
||||
new pod, the Linux NFS client auto-negotiates the highest available version — NFSv4.2 if
|
||||
the server supports it.
|
||||
|
||||
NFSv4 uses **idmapd** for UID translation: the server translates UID→username (e.g.,
|
||||
`999→postgres@domain`), sends the username string over the wire, and the client translates
|
||||
it back to a local UID. This fails when:
|
||||
|
||||
1. **Domain mismatch**: Server domain (from hostname) differs from client domain
|
||||
- TrueNAS: `viktorbarzin.me` (from `truenas.viktorbarzin.me`)
|
||||
- K8s nodes: `viktorbarzin.lan` (from `k8s-node4.viktorbarzin.lan`)
|
||||
- When domains don't match, ALL UIDs fall back to `nobody` (65534)
|
||||
|
||||
2. **Unknown UIDs**: Even with matching domains, if the NFS server has no local user for
|
||||
UID 999 (common for container UIDs), idmapd maps it to `nobody`
|
||||
|
||||
**Why existing mounts work**: Older kubelet mounts (established before NFSv4 was enabled,
|
||||
or when the NFS client defaulted to v3) continue using NFSv3 with direct numeric UID
|
||||
passthrough. Only NEW mounts negotiate NFSv4.2.
|
||||
|
||||
## Solution
|
||||
|
||||
**Fix on TrueNAS (no NFS restart required):**
|
||||
|
||||
```bash
|
||||
# 1. Enable NFSv3-style numeric UID passthrough for NFSv4
|
||||
midclt call nfs.update '{"v4_v3owner": true, "v4_domain": "viktorbarzin.lan"}'
|
||||
|
||||
# 2. Restart nfsuserd with the correct domain (NOT nfsd — that would crash the cluster)
|
||||
killall nfsuserd
|
||||
nfsuserd -domain viktorbarzin.lan -force
|
||||
```
|
||||
|
||||
**Clear caches on all K8s nodes:**
|
||||
|
||||
```bash
|
||||
for node in k8s-node1 k8s-node2 k8s-node3 k8s-node4; do
|
||||
ssh wizard@$node "sudo nfsidmap -c && sudo keyctl clear @u"
|
||||
done
|
||||
```
|
||||
|
||||
**Key settings explained:**
|
||||
- `v4_v3owner = true`: Makes NFSv4 use numeric UID passthrough like NFSv3, completely
|
||||
bypassing the username-based idmapd translation. **This is the critical fix.**
|
||||
- `v4_domain`: Should match the K8s nodes' DNS domain (check with `hostname -d` on a node)
|
||||
- `nfsuserd -domain <domain> -force`: FreeBSD daemon that handles NFSv4 user mapping.
|
||||
The `-force` flag is required if it thinks it's already running.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Run a test pod and check UIDs
|
||||
kubectl run nfs-test --rm -it --restart=Never --image=alpine \
|
||||
--overrides='{"spec":{"containers":[{"name":"test","image":"alpine",
|
||||
"command":["sh","-c","ls -lan /data | head -5"],
|
||||
"volumeMounts":[{"name":"nfs","mountPath":"/data"}]}],
|
||||
"volumes":[{"name":"nfs","nfs":{"server":"10.0.10.15","path":"/mnt/main/some-path"}}]}}'
|
||||
|
||||
# Should show actual UIDs (e.g., 999, 472) instead of 65534
|
||||
```
|
||||
|
||||
## Debugging Steps
|
||||
|
||||
If you're not sure whether this is the issue:
|
||||
|
||||
```bash
|
||||
# 1. Check mount type INSIDE a container (not on the host!)
|
||||
kubectl exec <pod> -- mount | grep nfs
|
||||
# If it shows "type nfs4" with "vers=4.2" — this is the issue
|
||||
|
||||
# 2. Compare UIDs: host vs container
|
||||
# On host (via kubelet mount path):
|
||||
sudo ls -lan /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~nfs/<vol>/
|
||||
# Inside container:
|
||||
kubectl exec <pod> -- ls -lan /mount-path/
|
||||
|
||||
# 3. Check TrueNAS NFS config
|
||||
midclt call nfs.config # Look for v4: true, v4_v3owner, v4_domain
|
||||
|
||||
# 4. Check nfsuserd is running with the right domain
|
||||
ps aux | grep nfsuserd # On TrueNAS
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- **NEVER restart NFS (nfsd)** on TrueNAS — it causes mount failures across ALL pods
|
||||
cluster-wide. Only restart `nfsuserd` (the ID mapping daemon).
|
||||
- Existing NFSv3 mounts continue working fine. The issue only affects NEW mounts.
|
||||
- The `v4_v3owner` setting is persistent across TrueNAS reboots (stored in middleware config).
|
||||
- The `nfsuserd` restart is NOT persistent — TrueNAS may restart it without the `-domain`
|
||||
flag after a reboot. The `v4_domain` setting in the middleware config should handle this,
|
||||
but verify after any TrueNAS restart.
|
||||
- On Linux NFS servers (not FreeBSD/TrueNAS), the equivalent fix is setting `Domain` in
|
||||
`/etc/idmapd.conf` on both server and all clients.
|
||||
216
.claude/skills/archived/openclaw-k8s-deployment/SKILL.md
Normal file
216
.claude/skills/archived/openclaw-k8s-deployment/SKILL.md
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
---
|
||||
name: openclaw-k8s-deployment
|
||||
description: |
|
||||
Deploy and troubleshoot OpenClaw gateway on Kubernetes. Use when:
|
||||
(1) OpenClaw gateway won't start or shows "Telegram configured, not enabled yet",
|
||||
(2) exec fails with "requires a paired node (none available)",
|
||||
(3) gateway shows "Config invalid" for exec.host or exec.security values,
|
||||
(4) OpenClaw can't write files (EACCES on workspace or home),
|
||||
(5) gateway takes 5+ minutes to start (CPU throttling by VPA/LimitRange),
|
||||
(6) 502 Bad Gateway from Traefik after pod restart,
|
||||
(7) setting up Telegram bot channel,
|
||||
(8) configuring modelrelay sidecar for free model routing.
|
||||
Covers all non-obvious deployment gotchas discovered through trial and error.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-01
|
||||
---
|
||||
|
||||
# OpenClaw Kubernetes Deployment
|
||||
|
||||
## Problem
|
||||
Deploying OpenClaw as a Kubernetes pod involves many non-obvious configuration
|
||||
requirements. The gateway process, Telegram integration, exec permissions, and
|
||||
file ownership all have specific constraints not documented together.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Deploying OpenClaw from `ghcr.io/openclaw/openclaw` container image
|
||||
- Running in Kubernetes with NFS volumes, Traefik ingress, Goldilocks/VPA
|
||||
- Want Telegram bot integration, tool execution, and persistent state
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Gateway Configuration (openclaw.json)
|
||||
|
||||
**Required fields that aren't obvious:**
|
||||
|
||||
```json
|
||||
{
|
||||
"gateway": {
|
||||
"mode": "local",
|
||||
"bind": "lan",
|
||||
"controlUi": {
|
||||
"dangerouslyDisableDeviceAuth": true,
|
||||
"dangerouslyAllowHostHeaderOriginFallback": true
|
||||
}
|
||||
},
|
||||
"wizard": {
|
||||
"lastRunAt": "2026-03-01T00:00:00.000Z",
|
||||
"lastRunVersion": "2026.2.26",
|
||||
"lastRunCommand": "configure",
|
||||
"lastRunMode": "local"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- `gateway.mode = "local"` — **required** or gateway refuses to start
|
||||
- `dangerouslyAllowHostHeaderOriginFallback = true` — required in v2026.2.26+
|
||||
for non-loopback Control UI (error: "non-loopback Control UI requires
|
||||
gateway.controlUi.allowedOrigins")
|
||||
- `wizard` block — **required** for Telegram to start. Without it, gateway logs
|
||||
"Telegram configured, not enabled yet" on every startup. The wizard block
|
||||
signals that initial setup was completed.
|
||||
|
||||
### 2. Exec Configuration
|
||||
|
||||
Valid values for `tools.exec`:
|
||||
|
||||
| Field | Valid Values | Notes |
|
||||
|-------|-------------|-------|
|
||||
| `host` | `sandbox`, `gateway`, `node` | NOT "local" — that's invalid |
|
||||
| `security` | `deny`, `allowlist`, `full` | NOT "off" — that's invalid |
|
||||
| `ask` | `"off"` | Disables confirmation prompts |
|
||||
|
||||
- `host = "gateway"` — runs commands on the container host directly
|
||||
- `host = "node"` — requires a "paired node" companion app (doesn't work in containers)
|
||||
- `host = "sandbox"` — requires Docker-in-Docker
|
||||
- `security = "full"` — most permissive valid option
|
||||
|
||||
### 3. Sandbox Mode
|
||||
|
||||
```json
|
||||
{
|
||||
"agents": {
|
||||
"defaults": {
|
||||
"sandbox": { "mode": "off" },
|
||||
"workspace": "/workspace/infra"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- `sandbox.mode = "off"` disables Docker sandboxing
|
||||
- `workspace` must be set explicitly — defaults to `~/.openclaw/workspace`
|
||||
|
||||
### 4. File Permissions
|
||||
|
||||
The init container runs as root but the main container runs as `node` (UID 1000).
|
||||
|
||||
**Must chown in init container:**
|
||||
```sh
|
||||
chown -R 1000:1000 /workspace/infra
|
||||
chown -R 1000:1000 /openclaw-home
|
||||
chmod 700 /openclaw-home
|
||||
```
|
||||
|
||||
**Must create directories:**
|
||||
```sh
|
||||
mkdir -p /openclaw-home/agents/main/sessions \
|
||||
/openclaw-home/credentials \
|
||||
/openclaw-home/canvas \
|
||||
/openclaw-home/devices \
|
||||
/openclaw-home/cron
|
||||
```
|
||||
|
||||
Without these: `EACCES: permission denied` errors for AGENTS.md, canvas,
|
||||
cron/jobs.json, devices, and other runtime files.
|
||||
|
||||
### 5. Startup Command
|
||||
|
||||
```sh
|
||||
node openclaw.mjs doctor --fix 2>/dev/null; exec node openclaw.mjs gateway --allow-unconfigured --bind lan
|
||||
```
|
||||
|
||||
Run `doctor --fix` before the gateway to auto-enable Telegram and fix
|
||||
config issues. Without this, Telegram stays "not enabled yet".
|
||||
|
||||
### 6. Resource Requirements
|
||||
|
||||
- **CPU limit: 2 cores minimum** — the Node.js gateway startup is CPU-intensive.
|
||||
With 150-300m CPU, startup takes 5+ minutes.
|
||||
- **Memory limit: 2Gi minimum** — the gateway OOM-kills at 1Gi during startup
|
||||
(V8 heap exhaustion).
|
||||
- **Goldilocks VPA will override these** — see "VPA Override" section below.
|
||||
|
||||
### 7. Readiness Probe
|
||||
|
||||
```hcl
|
||||
readiness_probe {
|
||||
tcp_socket { port = 18789 }
|
||||
initial_delay_seconds = 30
|
||||
period_seconds = 10
|
||||
}
|
||||
```
|
||||
|
||||
Do NOT use a startup probe — the gateway can take 2-3 minutes to start listening
|
||||
and a startup probe will kill it. Use readiness-only to prevent 502s from Traefik
|
||||
during startup without killing the container.
|
||||
|
||||
### 8. Telegram Integration
|
||||
|
||||
```json
|
||||
{
|
||||
"channels": {
|
||||
"telegram": {
|
||||
"enabled": true,
|
||||
"botToken": "...",
|
||||
"dmPolicy": "allowlist",
|
||||
"allowFrom": ["tg:USER_ID"],
|
||||
"groupPolicy": "allowlist",
|
||||
"streamMode": "partial"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Telegram won't start without:
|
||||
1. The `wizard` block in config (signals setup was run)
|
||||
2. `doctor --fix` at startup (auto-enables the channel)
|
||||
3. Both `groupPolicy` and `streamMode` fields
|
||||
|
||||
### 9. NFS Volume Strategy
|
||||
|
||||
| Volume | Purpose | Type |
|
||||
|--------|---------|------|
|
||||
| `/home/node/.openclaw` | Persistent state (SOUL.md, sessions, memory, telegram) | NFS |
|
||||
| `/tools` | Cached binaries (kubectl, terraform, terragrunt, python libs) | NFS |
|
||||
| `/workspace` | Infra repo clone | NFS |
|
||||
| `/data` | General data | NFS |
|
||||
|
||||
Using NFS for tools cache reduces restart time from ~2.5min to ~38s by skipping
|
||||
binary downloads and pip installs on subsequent starts.
|
||||
|
||||
### 10. ModelRelay Sidecar
|
||||
|
||||
Deploy as a sidecar container for automatic free model routing:
|
||||
|
||||
```hcl
|
||||
container {
|
||||
name = "modelrelay"
|
||||
image = "node:22-alpine"
|
||||
command = ["sh", "-c", "npm install -g modelrelay; exec modelrelay --port 7352"]
|
||||
env { name = "NVIDIA_API_KEY"; value = "..." }
|
||||
env { name = "OPENROUTER_API_KEY"; value = "..." }
|
||||
}
|
||||
```
|
||||
|
||||
Configure as provider: `baseUrl = "http://127.0.0.1:7352/v1"`, model `auto-fastest`.
|
||||
|
||||
## Verification
|
||||
1. `kubectl logs -c openclaw` should show `[gateway] listening on ws://0.0.0.0:18789`
|
||||
2. No "Telegram configured, not enabled yet" message
|
||||
3. No `EACCES` permission errors
|
||||
4. `kubectl exec ... -- cat /proc/net/tcp` shows listening sockets
|
||||
5. Telegram bot responds to `/start`
|
||||
|
||||
## Notes
|
||||
- ConfigMap changes require pod restart (init container copies config at start)
|
||||
- ConfigMap taint+reinit sometimes needed when Terraform state gets out of sync
|
||||
- Goldilocks VPA recreates itself from namespace labels — must delete VPA on
|
||||
every pod recreation if namespace has `goldilocks.fairwinds.com/vpa-update-mode`
|
||||
- The `--allow-unconfigured` flag is needed for the gateway command
|
||||
- v2026.2.26 introduced breaking change requiring `dangerouslyAllowHostHeaderOriginFallback`
|
||||
|
||||
## See also
|
||||
- `openclaw-custom-model-provider` — basic model provider configuration
|
||||
- `k8s-limitrange-oom-silent-kill` — LimitRange causing OOM (related but different)
|
||||
|
|
@ -0,0 +1,169 @@
|
|||
---
|
||||
name: pfsense-dnsmasq-interface-binding
|
||||
description: |
|
||||
Restrict pfSense dnsmasq (DNS Forwarder) to specific interfaces to free port 53 on
|
||||
other interfaces for port forwarding. Use when: (1) pfSense blocks port 53 NAT port
|
||||
forward because dnsmasq is listening on *:53, (2) need to forward DNS from WAN to an
|
||||
internal DNS server while preserving client source IPs, (3) dnsmasq shows *:53 in
|
||||
sockstat despite --listen-address flags, (4) pfSense loses DNS resolution after
|
||||
restricting dnsmasq interfaces, (5) NAT rdr rules for port 53 silently fail to
|
||||
generate in /tmp/rules.debug.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# pfSense dnsmasq Interface Binding for DNS Port Forwarding
|
||||
|
||||
## Problem
|
||||
pfSense's dnsmasq (DNS Forwarder) binds to `*:53` by default. This prevents creating
|
||||
NAT port forward rules for port 53 — pfSense silently skips generating the pf `rdr`
|
||||
directive. You need to restrict dnsmasq to specific interfaces to free port 53 on other
|
||||
interfaces (e.g., WAN) for forwarding to an internal DNS server.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Attempting to create a NAT port forward for port 53 on the WAN interface
|
||||
- Port forward rule saves to config.xml but `pfctl -sn` shows no corresponding `rdr` rule
|
||||
- `sockstat -4 | grep ":53"` shows `dnsmasq` on `*:53`
|
||||
- Goal: Forward DNS queries from one network to an internal DNS server (e.g., Technitium)
|
||||
while preserving client source IPs (no masquerading)
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Bind dnsmasq to specific interfaces
|
||||
|
||||
Set the interface field in pfSense's dnsmasq config:
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("service-utils.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
$config["dnsmasq"]["interface"] = "lan,opt1"; // Only LAN and OPT1, NOT wan
|
||||
write_config("Bind dnsmasq to LAN and OPT1 only");
|
||||
'"'"''
|
||||
```
|
||||
|
||||
This adds `--listen-address=<IP>` flags to dnsmasq but does NOT change socket binding.
|
||||
|
||||
### Step 2: Add bind-dynamic (CRITICAL)
|
||||
|
||||
Without `bind-dynamic`, dnsmasq still binds the socket to `*:53` even with
|
||||
`--listen-address` flags. The `--listen-address` only controls which queries get
|
||||
responses, not the actual socket binding.
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("service-utils.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
$existing = base64_decode($config["dnsmasq"]["custom_options"]);
|
||||
if (strpos($existing, "bind-dynamic") === false) {
|
||||
$existing = "bind-dynamic\n" . $existing;
|
||||
$config["dnsmasq"]["custom_options"] = base64_encode($existing);
|
||||
write_config("Add bind-dynamic to restrict dnsmasq socket binding");
|
||||
}
|
||||
'"'"''
|
||||
```
|
||||
|
||||
### Step 3: Add localhost listen address (CRITICAL)
|
||||
|
||||
pfSense's own `resolv.conf` points to `127.0.0.1`. Without this, pfSense itself
|
||||
loses DNS resolution after the interface restriction.
|
||||
|
||||
```php
|
||||
# Add to custom_options (base64-encoded in config):
|
||||
listen-address=127.0.0.1
|
||||
```
|
||||
|
||||
### Step 4: Restart dnsmasq
|
||||
|
||||
```php
|
||||
services_dnsmasq_configure();
|
||||
```
|
||||
|
||||
### Step 5: Verify binding
|
||||
|
||||
```bash
|
||||
sockstat -4 | grep ":53 "
|
||||
# Should show specific IPs, not *:53:
|
||||
# 127.0.0.1:53
|
||||
# 10.0.10.1:53 (lan)
|
||||
# 10.0.20.1:53 (opt1)
|
||||
# NOT 192.168.1.2:53 (wan)
|
||||
```
|
||||
|
||||
### Step 6: Add the port forward rule
|
||||
|
||||
**Critical format note**: The `source` field must use `array("any" => "")`, NOT
|
||||
`array("network" => "192.168.1.0/24")`. The CIDR source format silently fails to
|
||||
generate the pf `rdr` directive.
|
||||
|
||||
```php
|
||||
ssh admin@10.0.20.1 'php -r '"'"'
|
||||
require_once("config.inc");
|
||||
require_once("filter.inc");
|
||||
require_once("shaper.inc");
|
||||
global $config;
|
||||
$config = parse_config(true);
|
||||
|
||||
$rule = array(
|
||||
"source" => array("any" => ""), // MUST be "any", not CIDR
|
||||
"destination" => array(
|
||||
"network" => "wanip",
|
||||
"port" => "53"
|
||||
),
|
||||
"ipprotocol" => "inet",
|
||||
"protocol" => "udp",
|
||||
"target" => "10.0.20.204", // Internal DNS server
|
||||
"local-port" => "53",
|
||||
"interface" => "wan",
|
||||
"associated-rule-id" => "pass",
|
||||
"descr" => "DNS to internal DNS (preserve client IP)",
|
||||
"created" => array("time" => (string)time(), "username" => "admin"),
|
||||
"updated" => array("time" => (string)time(), "username" => "admin")
|
||||
);
|
||||
array_unshift($config["nat"]["rule"], $rule);
|
||||
write_config("Add DNS port forward");
|
||||
filter_configure();
|
||||
'"'"''
|
||||
```
|
||||
|
||||
### Step 7: Verify the redirect rule
|
||||
|
||||
```bash
|
||||
pfctl -sn | grep "domain\|:53"
|
||||
# Should show: rdr pass on vtnet0 inet proto udp from any to 192.168.1.2 port = domain -> 10.0.20.204
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
1. pfSense own DNS: `nslookup google.com 127.0.0.1` (from pfSense shell)
|
||||
2. Internal DNS: `nslookup google.com 10.0.20.1` (from LAN/OPT1 clients)
|
||||
3. Port forward: `dig @192.168.1.2 example.com` (from WAN-side client)
|
||||
4. Client IP: Check DNS server logs — should show real client IP, not pfSense IP
|
||||
|
||||
## Pitfalls
|
||||
|
||||
| Pitfall | Symptom | Fix |
|
||||
|---------|---------|-----|
|
||||
| Missing `bind-dynamic` | sockstat shows `*:53`, port forward still blocked | Add `bind-dynamic` to custom_options |
|
||||
| Missing `listen-address=127.0.0.1` | pfSense loses all DNS resolution | Add to custom_options |
|
||||
| Source `"network" => "CIDR"` in NAT rule | Rule saves to config but no `rdr` in `pfctl -sn` | Use `"any" => ""` instead |
|
||||
| Using local `$config` variable | Config not persisted after PHP exit | Always use `global $config` |
|
||||
| Not calling `filter_configure()` | Rule in config.xml but not in pf | Call after `write_config()` |
|
||||
| Custom options not base64 | dnsmasq fails to start | pfSense stores custom_options as base64 |
|
||||
|
||||
## Notes
|
||||
- `bind-dynamic` is preferred over `bind-interfaces` because it handles interfaces that
|
||||
come up after dnsmasq starts (e.g., VPN tunnels)
|
||||
- The pf `rdr` rule is a redirect, not masquerade — source IP is preserved
|
||||
- dnsmasq custom_options in pfSense config.xml are base64-encoded
|
||||
- Check `/tmp/rules.debug` for the generated pf ruleset (before loading into pf)
|
||||
- Use `pfctl -sn` to see rules actually loaded in the running firewall
|
||||
|
||||
## See also
|
||||
- `pfsense` — General pfSense management skill
|
||||
- `k8s-ndots-search-domain-nxdomain-flood` — Related DNS optimization
|
||||
105
.claude/skills/archived/pfsense-nat-rule-creation/SKILL.md
Normal file
105
.claude/skills/archived/pfsense-nat-rule-creation/SKILL.md
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
---
|
||||
name: pfsense-nat-rule-creation
|
||||
description: |
|
||||
Create NAT port forward rules on pfSense programmatically via PHP/SSH.
|
||||
Use when: (1) adding port forwards for new K8s services, (2) NAT rules
|
||||
added via PHP don't appear in pfctl output, (3) config_read_array() throws
|
||||
"undefined function" error, (4) destination "wanip" not working in NAT rules,
|
||||
(5) rules saved to config.xml but not loaded into pfctl. Covers the correct
|
||||
PHP array structure, config API differences between pfSense versions, and
|
||||
the required pfctl reload step.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-21
|
||||
---
|
||||
|
||||
# pfSense NAT Rule Creation via PHP
|
||||
|
||||
## Problem
|
||||
Creating NAT port forward rules on pfSense programmatically via SSH/PHP has
|
||||
multiple gotchas around the config API, rule structure, and rule loading.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Adding a port forward for a new Kubernetes service (e.g., TURN, game server)
|
||||
- Using `ssh admin@10.0.20.1` + PHP to automate pfSense config
|
||||
- NAT rules don't appear in `pfctl -sn` after `write_config()` + `filter_configure()`
|
||||
- `config_read_array()` throws "Call to undefined function"
|
||||
- Rules saved to config.xml but pfctl doesn't have them
|
||||
|
||||
## Solution
|
||||
|
||||
### Correct PHP for adding NAT rules
|
||||
|
||||
```php
|
||||
<?php
|
||||
require_once("config.inc");
|
||||
require_once("filter.inc");
|
||||
global $config; // NOT config_read_array() — that doesn't exist in pfSense 2.7.x
|
||||
|
||||
$config["nat"]["rule"][] = array(
|
||||
"interface" => "wan",
|
||||
"ipprotocol" => "inet", // Required! Must be "inet" for IPv4
|
||||
"protocol" => "tcp/udp", // Or "udp" or "tcp"
|
||||
"source" => array("any" => ""),
|
||||
"destination" => array(
|
||||
"network" => "wanip", // Use "network" => "wanip", NOT "address" => "wanip"
|
||||
"port" => "3478" // Single port or "start:end" for range
|
||||
),
|
||||
"target" => "10.0.20.200", // Internal destination IP
|
||||
"local-port" => "3478", // Internal port (for ranges, just the start port)
|
||||
"descr" => "My port forward",
|
||||
"associated-rule-id" => "pass" // Auto-create firewall pass rule
|
||||
);
|
||||
|
||||
write_config("Description for config history");
|
||||
filter_configure();
|
||||
```
|
||||
|
||||
### Key gotchas
|
||||
|
||||
1. **`config_read_array()` doesn't exist** in pfSense 2.7.x. Use `global $config` instead.
|
||||
|
||||
2. **Destination format**: Use `"network" => "wanip"`, NOT `"address" => "wanip"` or `"address" => "192.168.1.2"`. The `"network"` key with `"wanip"` tells pfSense to resolve the WAN IP dynamically.
|
||||
|
||||
3. **`ipprotocol` is required**: Must include `"ipprotocol" => "inet"` or rules won't generate in `/tmp/rules.debug`.
|
||||
|
||||
4. **Port ranges**: Use `"port" => "49152:49252"` for ranges. The `"local-port"` should be just the start port — pfSense maps the range automatically.
|
||||
|
||||
5. **Rules may not load immediately**: After `write_config()` + `filter_configure()`, rules appear in `/tmp/rules.debug` but may not be in pfctl until the next filter reload. Force with:
|
||||
```bash
|
||||
pfctl -f /tmp/rules.debug
|
||||
```
|
||||
|
||||
6. **SSH quoting**: The pfsense.py `php` command breaks on `\n` in strings. For multi-line PHP, write a `.php` file, `scp` it, and execute:
|
||||
```bash
|
||||
scp script.php admin@10.0.20.1:/tmp/
|
||||
ssh admin@10.0.20.1 "php /tmp/script.php"
|
||||
```
|
||||
|
||||
### Execution via pfsense.py
|
||||
|
||||
For simple single-line PHP (no newlines or backslashes):
|
||||
```bash
|
||||
python3 .claude/pfsense.py php 'require_once("config.inc"); ...; echo "Done";'
|
||||
```
|
||||
|
||||
For complex scripts, use scp + ssh as above.
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
# Check rules in config
|
||||
ssh admin@10.0.20.1 "grep 'YOUR_PORT' /cf/conf/config.xml"
|
||||
|
||||
# Check generated pf rules
|
||||
ssh admin@10.0.20.1 "grep 'YOUR_PORT' /tmp/rules.debug"
|
||||
|
||||
# Check active pfctl rules
|
||||
python3 .claude/pfsense.py pfctl "-sn" | grep YOUR_PORT
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Existing working NAT rules on this pfSense use the same structure (check WireGuard port 51820 as reference)
|
||||
- The `associated-rule-id: pass` auto-creates a WAN firewall rule to allow the forwarded traffic
|
||||
- pfSense applies NAT rules across ALL interfaces when using the web UI, but PHP-created rules only apply to the specified interface
|
||||
- See also: `pfsense` skill for general pfSense management
|
||||
|
|
@ -0,0 +1,136 @@
|
|||
---
|
||||
name: proxmox-vm-disk-expansion-pitfalls
|
||||
description: |
|
||||
Troubleshoot common failures when expanding Proxmox VM disks on Ubuntu 24.04
|
||||
cloud-init images and draining Kubernetes nodes. Use when: (1) growpart fails
|
||||
with "command not found" on Ubuntu cloud-init VMs, (2) grep -P fails on macOS
|
||||
with "invalid option -- P", (3) kubectl drain times out with pods stuck
|
||||
terminating, (4) filesystem shows old size after qm resize. Covers
|
||||
cloud-guest-utils installation, macOS-portable regex parsing, drain timeout
|
||||
tuning, and recovery from partial failures.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-13
|
||||
---
|
||||
|
||||
# Proxmox VM Disk Expansion Pitfalls
|
||||
|
||||
## Problem
|
||||
|
||||
Expanding disk storage on Proxmox-hosted Ubuntu 24.04 cloud-init VMs (used as
|
||||
Kubernetes nodes) fails at multiple points due to missing tools, cross-platform
|
||||
incompatibilities, and Kubernetes drain timeouts.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
|
||||
- Running disk expansion scripts from macOS against Proxmox + Ubuntu VMs
|
||||
- Ubuntu 24.04 cloud-init images (the default k8s node template)
|
||||
- Kubernetes nodes with many pods or stateful workloads
|
||||
- Using `scripts/extend_vm_storage.sh` or similar automation
|
||||
|
||||
## Issues and Solutions
|
||||
|
||||
### 1. `growpart: command not found` on Ubuntu 24.04
|
||||
|
||||
**Symptom**: After `qm resize`, SSH into VM, run `growpart /dev/sda 1` — fails
|
||||
with "command not found". `resize2fs` then reports "Nothing to do!" because the
|
||||
partition table hasn't been updated.
|
||||
|
||||
**Root cause**: Ubuntu 24.04 cloud-init images don't include `cloud-guest-utils`
|
||||
by default. The `growpart` tool (which updates the partition table to use new
|
||||
disk space) is in this package.
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
|
||||
sudo growpart /dev/sda 1
|
||||
sudo resize2fs /dev/sda1
|
||||
```
|
||||
|
||||
**Prevention**: Check for `growpart` before attempting partition expansion:
|
||||
```bash
|
||||
if ! command -v growpart &>/dev/null; then
|
||||
sudo apt-get update -qq && sudo apt-get install -y -qq cloud-guest-utils
|
||||
fi
|
||||
```
|
||||
|
||||
### 2. `grep -P` (PCRE) not available on macOS
|
||||
|
||||
**Symptom**: Script running on macOS fails with `grep: invalid option -- P`.
|
||||
|
||||
**Root cause**: macOS ships BSD grep, which doesn't support `-P` (Perl-compatible
|
||||
regex). GNU grep (from Homebrew) does, but scripts shouldn't assume it's installed.
|
||||
|
||||
**Fix**: Replace `grep -oP 'pattern\Kcapture'` with portable `sed`:
|
||||
```bash
|
||||
# BAD (GNU grep only):
|
||||
CURRENT_SIZE=$(echo "$LINE" | grep -oP 'size=\K[0-9]+G')
|
||||
|
||||
# GOOD (portable):
|
||||
CURRENT_SIZE=$(echo "$LINE" | sed -n 's/.*size=\([0-9]*G\).*/\1/p')
|
||||
```
|
||||
|
||||
**General rule**: In scripts that run on macOS, avoid `grep -P`, `sed -i ''`
|
||||
vs `sed -i` differences, and `date` flag differences. Use `sed` with basic
|
||||
regex or bash built-in `[[ =~ ]]` for pattern matching.
|
||||
|
||||
### 3. `kubectl drain` timeout with stuck pods
|
||||
|
||||
**Symptom**: `kubectl drain --timeout=120s` fails with "context deadline exceeded"
|
||||
for multiple pods. Pods are evicted but don't terminate in time.
|
||||
|
||||
**Root cause**: Some pods (stateful services like ClickHouse, Paperless-ngx,
|
||||
OnlyOffice) need more time to shut down gracefully. 120s isn't enough when many
|
||||
pods are draining simultaneously.
|
||||
|
||||
**Fix**: Use `--force` flag and a longer timeout, or retry:
|
||||
```bash
|
||||
# First attempt with standard timeout
|
||||
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=120s
|
||||
|
||||
# If it fails, force with longer timeout (pods already evicting)
|
||||
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=300s --force
|
||||
```
|
||||
|
||||
**Note**: After a failed drain, the node is already cordoned. A second drain
|
||||
attempt only needs to wait for already-evicting pods to finish.
|
||||
|
||||
### 4. Recovery from partial failure
|
||||
|
||||
If the script fails mid-way (after drain but before uncordon):
|
||||
|
||||
```bash
|
||||
# Check VM status
|
||||
ssh root@192.168.1.127 "qm status <vmid>"
|
||||
|
||||
# Start VM if stopped
|
||||
ssh root@192.168.1.127 "qm start <vmid>"
|
||||
|
||||
# Uncordon node
|
||||
kubectl --kubeconfig $(pwd)/config uncordon <node-name>
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
After successful expansion:
|
||||
```bash
|
||||
# On the VM
|
||||
df -h /
|
||||
# Should show new size (128G disk → ~126G usable for ext4)
|
||||
|
||||
# On the cluster
|
||||
kubectl get node <name>
|
||||
# Should show Ready status
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The k8s node VMs use direct partition layout (`/dev/sda1`), not LVM, despite
|
||||
the script handling both paths
|
||||
- `growpart` returns exit code 1 for "NOCHANGE" (partition already at max) —
|
||||
this is not an error
|
||||
- Proxmox `qm resize` uses `scsi0` as the disk identifier for these VMs
|
||||
- SSH host keys may change if VMs are recreated or network changes — use
|
||||
`-o StrictHostKeyChecking=no` in automated scripts
|
||||
|
||||
See also: `extend-vm-storage.md` (the operational skill for running the script)
|
||||
182
.claude/skills/archived/python-filename-sanitization/SKILL.md
Normal file
182
.claude/skills/archived/python-filename-sanitization/SKILL.md
Normal file
|
|
@ -0,0 +1,182 @@
|
|||
---
|
||||
name: python-filename-sanitization
|
||||
description: |
|
||||
Secure filename sanitization pattern for Python web applications. Use when:
|
||||
(1) Accepting user-provided filenames for file operations, (2) Building file
|
||||
rename/upload functionality, (3) Preventing path traversal attacks (../../../etc/passwd),
|
||||
(4) Preventing shell injection through filenames, (5) FastAPI/Flask file handling.
|
||||
Provides regex-based whitelist approach with pathlib for safe file operations.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-31
|
||||
---
|
||||
|
||||
# Python Filename Sanitization
|
||||
|
||||
## Problem
|
||||
User-provided filenames can contain malicious characters that enable path traversal
|
||||
attacks, shell injection, or filesystem corruption. Direct use of user input in
|
||||
file paths is a security vulnerability.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Building file upload, rename, or download functionality
|
||||
- User can specify filenames via API or form input
|
||||
- Files are stored on server filesystem
|
||||
- Need to prevent: `../`, shell metacharacters, null bytes, etc.
|
||||
|
||||
## Solution
|
||||
|
||||
### Complete Sanitization Function
|
||||
```python
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
def sanitize_filename(filename: str, max_length: int = 200) -> str:
|
||||
"""
|
||||
Sanitize a filename to prevent path traversal and shell injection.
|
||||
Only allows alphanumeric characters, spaces, hyphens, underscores,
|
||||
parentheses, and dots.
|
||||
"""
|
||||
if not filename:
|
||||
raise ValueError("Filename cannot be empty")
|
||||
|
||||
# Remove any path components (prevent path traversal)
|
||||
filename = Path(filename).name
|
||||
|
||||
# Only allow safe characters: alphanumeric, space, hyphen, underscore, parentheses, dot
|
||||
# This regex removes anything that isn't in the allowed set
|
||||
safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
|
||||
|
||||
# Collapse multiple spaces/dots
|
||||
safe_filename = re.sub(r'\s+', ' ', safe_filename)
|
||||
safe_filename = re.sub(r'\.+', '.', safe_filename)
|
||||
|
||||
# Strip leading/trailing whitespace and dots
|
||||
safe_filename = safe_filename.strip(' .')
|
||||
|
||||
# Limit length
|
||||
if len(safe_filename) > max_length:
|
||||
safe_filename = safe_filename[:max_length]
|
||||
|
||||
if not safe_filename:
|
||||
raise ValueError("Filename contains no valid characters")
|
||||
|
||||
return safe_filename
|
||||
```
|
||||
|
||||
### FastAPI Integration Example
|
||||
```python
|
||||
from fastapi import APIRouter, HTTPException
|
||||
from pydantic import BaseModel
|
||||
from pathlib import Path
|
||||
|
||||
class RenameRequest(BaseModel):
|
||||
new_name: str
|
||||
|
||||
@router.patch("/files/{file_id}/rename")
|
||||
async def rename_file(file_id: str, request: RenameRequest):
|
||||
"""Rename a file with sanitized input."""
|
||||
file_dir = Path("/data/files") / file_id
|
||||
|
||||
if not file_dir.exists():
|
||||
raise HTTPException(status_code=404, detail="File not found")
|
||||
|
||||
# Find existing file
|
||||
files = list(file_dir.glob("*"))
|
||||
if not files:
|
||||
raise HTTPException(status_code=404, detail="No file found")
|
||||
|
||||
current_file = files[0]
|
||||
current_extension = current_file.suffix
|
||||
|
||||
# Sanitize the new name
|
||||
try:
|
||||
safe_name = sanitize_filename(request.new_name)
|
||||
except ValueError as e:
|
||||
raise HTTPException(status_code=400, detail=str(e))
|
||||
|
||||
# Preserve original extension
|
||||
if not safe_name.lower().endswith(current_extension.lower()):
|
||||
safe_name = safe_name + current_extension
|
||||
|
||||
# Create new path (same directory, new filename)
|
||||
new_file = file_dir / safe_name
|
||||
|
||||
# Check for conflicts
|
||||
if new_file.exists() and new_file != current_file:
|
||||
raise HTTPException(status_code=400, detail="A file with that name already exists")
|
||||
|
||||
# Rename using pathlib (no shell commands!)
|
||||
current_file.rename(new_file)
|
||||
|
||||
return {"status": "renamed", "new_filename": safe_name}
|
||||
```
|
||||
|
||||
## Key Security Principles
|
||||
|
||||
### 1. Whitelist, Don't Blacklist
|
||||
```python
|
||||
# BAD: Trying to block dangerous characters
|
||||
filename = filename.replace('../', '').replace('\x00', '')
|
||||
|
||||
# GOOD: Only allow known-safe characters
|
||||
safe_filename = re.sub(r'[^a-zA-Z0-9\s\-_().]', '', filename)
|
||||
```
|
||||
|
||||
### 2. Use pathlib, Not Shell Commands
|
||||
```python
|
||||
# BAD: Shell command (vulnerable to injection)
|
||||
os.system(f'mv "{old_path}" "{new_path}"')
|
||||
|
||||
# GOOD: Pure Python (no shell)
|
||||
old_path.rename(new_path)
|
||||
```
|
||||
|
||||
### 3. Extract Basename First
|
||||
```python
|
||||
# BAD: User could submit "../../../etc/passwd"
|
||||
filename = user_input
|
||||
|
||||
# GOOD: Extract just the filename part
|
||||
filename = Path(user_input).name
|
||||
```
|
||||
|
||||
### 4. Validate After Sanitization
|
||||
```python
|
||||
# Ensure something remains after sanitization
|
||||
if not safe_filename:
|
||||
raise ValueError("Filename contains no valid characters")
|
||||
```
|
||||
|
||||
## Verification
|
||||
```python
|
||||
# Test cases that should be handled safely
|
||||
assert sanitize_filename("normal.txt") == "normal.txt"
|
||||
assert sanitize_filename("../../../etc/passwd") == "etcpasswd"
|
||||
assert sanitize_filename("file; rm -rf /") == "file rm -rf"
|
||||
assert sanitize_filename(" spaces .txt") == "spaces.txt"
|
||||
assert sanitize_filename("$(whoami).txt") == "whoami.txt"
|
||||
|
||||
# Test cases that should raise errors
|
||||
try:
|
||||
sanitize_filename("") # Should raise ValueError
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
try:
|
||||
sanitize_filename("$#@!") # Should raise ValueError (no valid chars)
|
||||
except ValueError:
|
||||
pass
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is intentionally restrictive; expand the regex if you need Unicode support
|
||||
- For Unicode filenames, consider `unicodedata.normalize('NFKD', ...)` first
|
||||
- Max length of 200 is conservative; filesystem limits vary (255 bytes typical)
|
||||
- Always preserve file extensions when renaming to avoid breaking file associations
|
||||
- Consider adding a UUID prefix for guaranteed uniqueness in upload scenarios
|
||||
|
||||
## References
|
||||
- [OWASP Path Traversal](https://owasp.org/www-community/attacks/Path_Traversal)
|
||||
- [CWE-22: Path Traversal](https://cwe.mitre.org/data/definitions/22.html)
|
||||
- [Python pathlib documentation](https://docs.python.org/3/library/pathlib.html)
|
||||
116
.claude/skills/archived/sops-age-secrets-migration/SKILL.md
Normal file
116
.claude/skills/archived/sops-age-secrets-migration/SKILL.md
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
---
|
||||
name: sops-age-secrets-migration
|
||||
description: |
|
||||
Migrate from git-crypt to SOPS + age for multi-user secret management in a
|
||||
Terraform/Terragrunt infrastructure repo. Use when: (1) need per-user secret
|
||||
access control (git-crypt is all-or-nothing), (2) want operators to push PRs
|
||||
without seeing secrets (CI decrypts), (3) migrating from a single encrypted
|
||||
terraform.tfvars to structured secret management. Covers: JSON format (not YAML
|
||||
— Terraform can't parse YAML tfvars), race condition avoidance with parallel
|
||||
terragrunt applies, CI pipeline integration with Woodpecker, age key management,
|
||||
and the complete migration sequence.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-03-07
|
||||
---
|
||||
|
||||
# SOPS + age Secrets Migration from git-crypt
|
||||
|
||||
## Problem
|
||||
git-crypt encrypts entire files — anyone with the key decrypts everything. For multi-user
|
||||
setups where operators should push code without seeing secrets, you need per-value encryption
|
||||
with CI-only decryption.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Single `terraform.tfvars` encrypted with git-crypt containing 100+ secrets
|
||||
- Need to onboard operators who shouldn't see API keys, passwords, SSH keys
|
||||
- Want GitOps (secrets in git) but with access control
|
||||
- Terraform/Terragrunt stack-per-service architecture
|
||||
|
||||
## Solution
|
||||
|
||||
### 1. Use JSON, not YAML
|
||||
SOPS outputs the same format as input. `sops -d file.yaml` → YAML. `sops -d file.json` → JSON.
|
||||
Terraform natively supports `*.auto.tfvars.json` files. YAML is NOT valid HCL.
|
||||
|
||||
```
|
||||
secrets.sops.json → sops -d → secrets.auto.tfvars.json → Terraform reads it
|
||||
```
|
||||
|
||||
### 2. Split tfvars into config + secrets
|
||||
```
|
||||
config.tfvars ← plaintext (hostnames, IPs, DNS records)
|
||||
secrets.sops.json ← SOPS-encrypted (passwords, tokens, keys)
|
||||
```
|
||||
|
||||
### 3. Global decrypt, not per-stack hooks
|
||||
**CRITICAL**: Do NOT use `before_hook`/`after_hook` for decryption. With `terragrunt run --all`,
|
||||
70+ stacks run hooks in parallel, all writing to the same output file — race condition.
|
||||
|
||||
Instead, use a wrapper script that decrypts once:
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
# scripts/tg — decrypt then terragrunt
|
||||
REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
if [ ! -f "$REPO_ROOT/secrets.auto.tfvars.json" ] || \
|
||||
[ "$REPO_ROOT/secrets.sops.json" -nt "$REPO_ROOT/secrets.auto.tfvars.json" ]; then
|
||||
sops -d "$REPO_ROOT/secrets.sops.json" > "$REPO_ROOT/secrets.auto.tfvars.json"
|
||||
fi
|
||||
exec terragrunt "$@"
|
||||
```
|
||||
|
||||
### 4. Terragrunt loads both (backward compatible)
|
||||
```hcl
|
||||
terraform {
|
||||
extra_arguments "common_vars" {
|
||||
commands = get_terraform_commands_that_need_vars()
|
||||
required_var_files = ["${get_repo_root()}/config.tfvars"]
|
||||
optional_var_files = [
|
||||
"${get_repo_root()}/terraform.tfvars", # legacy (git-crypt)
|
||||
"${get_repo_root()}/secrets.auto.tfvars.json" # new (SOPS)
|
||||
]
|
||||
}
|
||||
before_hook "check_secrets" {
|
||||
commands = ["apply", "plan", "destroy"]
|
||||
execute = ["test", "-f", "${get_repo_root()}/secrets.auto.tfvars.json"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Complex types work in JSON
|
||||
Maps, lists, nested objects, multiline strings (SSH keys as `\n`-escaped) all work:
|
||||
```json
|
||||
{
|
||||
"simple_password": "abc123",
|
||||
"mailserver_accounts": {"user@domain": "pass"},
|
||||
"ssh_key": "-----BEGIN OPENSSH PRIVATE KEY-----\nb3Blbn...\n-----END OPENSSH PRIVATE KEY-----\n"
|
||||
}
|
||||
```
|
||||
|
||||
### 6. CI integration (Woodpecker)
|
||||
- Store age private key as CI secret (`SOPS_AGE_KEY`)
|
||||
- Write to temp file for `SOPS_AGE_KEY_FILE` (Woodpecker `from_secret` only does env vars)
|
||||
- `git add stacks/ state/ .woodpecker/` — NEVER `git add .`
|
||||
- Cleanup step with `status: [success, failure]`
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
# Encrypt
|
||||
sops -e -i secrets.sops.json
|
||||
|
||||
# Decrypt and verify
|
||||
sops -d secrets.sops.json | jq .
|
||||
|
||||
# Verify SSH keys
|
||||
sops -d secrets.sops.json | jq -r '.ssh_key' | ssh-keygen -l -f -
|
||||
|
||||
# Test with terragrunt
|
||||
scripts/tg validate
|
||||
```
|
||||
|
||||
## Notes
|
||||
- Keep git-crypt for binary files (TLS certs, deploy keys) — SOPS can't encrypt binary
|
||||
- `sensitive = true` on all secret variable declarations — prevents plan output leaks
|
||||
- Don't add `sensitive = true` to non-secret variables with "secret" in the name (e.g., `tls_secret_name`, `ingress_path`) — breaks `for_each` on lists
|
||||
- Age keys are one line — much simpler than GPG
|
||||
- `.sops.yaml` path_regex should be anchored: `^secrets\.sops\.json$`
|
||||
|
|
@ -0,0 +1,97 @@
|
|||
---
|
||||
name: terraform-state-identity-mismatch
|
||||
description: |
|
||||
Fix Terraform "Unexpected Identity Change" errors during plan/apply. Use when:
|
||||
(1) Terraform fails with "the Terraform Provider unexpectedly returned a different
|
||||
identity", (2) State refresh shows identity mismatch between stored and current values,
|
||||
(3) Resource was created but terraform apply timed out, leaving state inconsistent.
|
||||
Solution involves removing and reimporting the affected resource.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-01-28
|
||||
---
|
||||
|
||||
# Terraform State Identity Mismatch Fix
|
||||
|
||||
## Problem
|
||||
Terraform fails during plan or apply with an "Unexpected Identity Change" error,
|
||||
indicating the stored state identity doesn't match what the provider returns when
|
||||
reading the resource.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- Error message contains: "Unexpected Identity Change: During the read operation,
|
||||
the Terraform Provider unexpectedly returned a different identity"
|
||||
- Often occurs after a terraform apply times out mid-creation
|
||||
- Resource exists in the cluster/cloud but state is corrupted
|
||||
- Common with Kubernetes provider after deployment rollout timeouts
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Identify the affected resource
|
||||
The error message includes the resource address:
|
||||
```
|
||||
with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
|
||||
```
|
||||
|
||||
### Step 2: Remove from state
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
|
||||
```
|
||||
Note: Use single quotes around the address to handle brackets properly.
|
||||
|
||||
### Step 3: Import the resource back
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' <namespace>/<name>
|
||||
```
|
||||
For Kubernetes deployments, the import ID is `namespace/deployment-name`.
|
||||
|
||||
### Step 4: Verify with plan
|
||||
```bash
|
||||
terraform plan -target=<module-path>
|
||||
```
|
||||
Should show minimal or no changes if import was successful.
|
||||
|
||||
### Step 5: Apply to sync any drift
|
||||
```bash
|
||||
terraform apply -target=<module-path>
|
||||
```
|
||||
|
||||
## Verification
|
||||
- `terraform plan` runs without identity errors
|
||||
- `terraform apply` completes successfully
|
||||
- Resource still exists and functions correctly
|
||||
|
||||
## Example
|
||||
**Error:**
|
||||
```
|
||||
Error: Unexpected Identity Change
|
||||
|
||||
Current Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.NullVal...})
|
||||
New Identity: cty.ObjectVal(map[string]cty.Value{"api_version":cty.StringVal("apps/v1")...})
|
||||
|
||||
with module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume
|
||||
```
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
terraform state rm 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume'
|
||||
# Output: Removed ... Successfully removed 1 resource instance(s).
|
||||
|
||||
terraform import 'module.kubernetes_cluster.module.resume["resume"].kubernetes_deployment.resume' resume/resume
|
||||
# Output: Import successful!
|
||||
|
||||
terraform apply -target=module.kubernetes_cluster.module.resume -auto-approve
|
||||
# Output: Apply complete! Resources: 0 added, 1 changed, 0 destroyed.
|
||||
```
|
||||
|
||||
## Notes
|
||||
- This is a provider bug, not user error - consider reporting to provider maintainers
|
||||
- The resource continues to work fine; only the terraform state is affected
|
||||
- Always verify the resource exists before importing (don't import non-existent resources)
|
||||
- For Kubernetes resources, import IDs are typically `namespace/name`
|
||||
- For AWS resources, import IDs vary by resource type (check provider docs)
|
||||
- Consider adding `-lock=false` if state locking causes issues during recovery
|
||||
|
||||
## See Also
|
||||
- Terraform state management documentation
|
||||
- Kubernetes provider import documentation
|
||||
405
.claude/skills/archived/traefik-helm-configuration/SKILL.md
Normal file
405
.claude/skills/archived/traefik-helm-configuration/SKILL.md
Normal file
|
|
@ -0,0 +1,405 @@
|
|||
---
|
||||
name: traefik-helm-configuration
|
||||
description: |
|
||||
Consolidated Traefik Helm chart configuration skill covering HTTP/3 (QUIC), UDP
|
||||
cross-namespace routing, and plugin download failures. Use when:
|
||||
(1) enabling HTTP/3 on Traefik or Alt-Svc header shows wrong port (e.g., 8443 instead of 443),
|
||||
(2) HTTP/3 is configured in Helm values but not working end-to-end,
|
||||
(3) Cloudflare-proxied domains need HTTP/3 enabled,
|
||||
(4) custom UDP entrypoints don't appear in the LoadBalancer Service,
|
||||
(5) IngressRouteUDP logs show "udp service is not in the parent resource namespace",
|
||||
(6) DNS or other UDP traffic through Traefik times out despite correct IngressRouteUDP config,
|
||||
(7) all Traefik routes suddenly return 404 after a restart or pod recreation,
|
||||
(8) Traefik logs show "Plugins are disabled because an error has occurred",
|
||||
(9) plugin download fails with "context deadline exceeded" for crowdsec-bouncer or rewrite-body.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Traefik Helm Chart Configuration
|
||||
|
||||
Consolidated guide for three common Traefik Helm chart issues: HTTP/3 (QUIC) enablement,
|
||||
UDP cross-namespace routing, and plugin download failures causing global 404s.
|
||||
|
||||
---
|
||||
|
||||
## HTTP/3 (QUIC)
|
||||
|
||||
### Problem
|
||||
|
||||
You want to enable HTTP/3 (QUIC) on a Traefik ingress controller in Kubernetes so that
|
||||
clients can negotiate HTTP/3 connections via the `Alt-Svc` response header.
|
||||
|
||||
### Context / When to Use
|
||||
|
||||
- Enabling HTTP/3 for the first time on Traefik
|
||||
- Troubleshooting HTTP/3 not working despite configuration
|
||||
- Alt-Svc header shows internal container port (8443) instead of external port (443)
|
||||
- Need to enable HTTP/3 on both origin (Traefik) and CDN (Cloudflare)
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Configure Traefik Helm Chart Values
|
||||
|
||||
In the Traefik Helm release values, add `http3` configuration to the `websecure` entrypoint:
|
||||
|
||||
```hcl
|
||||
# In modules/kubernetes/traefik/main.tf
|
||||
ports = {
|
||||
websecure = {
|
||||
port = 8443
|
||||
exposedPort = 443
|
||||
protocol = "TCP"
|
||||
http = {
|
||||
tls = {
|
||||
enabled = true
|
||||
}
|
||||
}
|
||||
# Enable HTTP/3 (QUIC)
|
||||
http3 = {
|
||||
enabled = true
|
||||
advertisedPort = 443 # CRITICAL: Must match the external port
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Key gotcha: `advertisedPort = 443`**
|
||||
|
||||
Without `advertisedPort`, Traefik advertises the *internal container port* (8443) in the
|
||||
`Alt-Svc` header:
|
||||
```
|
||||
Alt-Svc: h3=":8443"; ma=2592000
|
||||
```
|
||||
|
||||
This is wrong because clients connect on external port 443, not 8443. The correct header is:
|
||||
```
|
||||
Alt-Svc: h3=":443"; ma=2592000
|
||||
```
|
||||
|
||||
Setting `advertisedPort = 443` fixes this.
|
||||
|
||||
#### Step 2: Ensure Helm Chart Fully Re-renders
|
||||
|
||||
Changing `http3.enabled=true` in values alone may not cause the Helm chart to add the
|
||||
required UDP port to the Service and Deployment specs. The Traefik Helm chart templates
|
||||
need to re-render to include `websecure-http3: 443/UDP` in the Service.
|
||||
|
||||
If the Service doesn't show a UDP port after applying:
|
||||
- See the companion skill `helm-release-force-rerender` for fixing this
|
||||
- The root cause is that `helm upgrade --reuse-values` (Terraform's default behavior)
|
||||
may not trigger template re-rendering for structural changes like adding new ports
|
||||
|
||||
After a successful apply, verify the Service has the UDP port:
|
||||
```bash
|
||||
kubectl get svc traefik -n traefik -o yaml | grep -A5 "443"
|
||||
```
|
||||
|
||||
Expected output should include both:
|
||||
```yaml
|
||||
- name: websecure
|
||||
port: 443
|
||||
protocol: TCP
|
||||
targetPort: websecure
|
||||
- name: websecure-http3
|
||||
port: 443
|
||||
protocol: UDP
|
||||
targetPort: websecure-http3
|
||||
```
|
||||
|
||||
#### Step 3: Enable HTTP/3 on Cloudflare (if using Cloudflare proxy)
|
||||
|
||||
For Cloudflare-proxied domains, HTTP/3 must also be enabled at the Cloudflare zone level.
|
||||
|
||||
**Cloudflare Provider v4** (current in this repo):
|
||||
```hcl
|
||||
resource "cloudflare_zone_settings_override" "http3" {
|
||||
zone_id = var.cloudflare_zone_id
|
||||
|
||||
settings {
|
||||
http3 = "on" # String values: "on" or "off"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Note**: In Cloudflare provider v5, this uses `cloudflare_zone_setting` (singular) with
|
||||
different syntax. The v4 resource is `cloudflare_zone_settings_override` (plural + override).
|
||||
|
||||
#### Step 4: Verify End-to-End
|
||||
|
||||
##### Testing from macOS
|
||||
|
||||
macOS system curl does NOT support HTTP/3. Install curl with HTTP/3:
|
||||
```bash
|
||||
brew install curl
|
||||
```
|
||||
|
||||
Then use the Homebrew version explicitly:
|
||||
```bash
|
||||
# Test HTTP/3 negotiation (Alt-Svc header)
|
||||
/opt/homebrew/opt/curl/bin/curl -sI https://example.viktorbarzin.me 2>&1 | grep -i alt-svc
|
||||
# Expected: alt-svc: h3=":443"; ma=2592000
|
||||
|
||||
# Test actual HTTP/3 connection
|
||||
/opt/homebrew/opt/curl/bin/curl --http3-only -sI https://example.viktorbarzin.me
|
||||
# Expected: HTTP/3 200
|
||||
```
|
||||
|
||||
##### Testing from within the Cluster
|
||||
|
||||
```bash
|
||||
# Use a curl image with HTTP/3 support (amd64 only)
|
||||
kubectl run curl-h3 --rm -it --image=ymuski/curl-http3 --restart=Never -- \
|
||||
curl --http3-only -sI https://example.viktorbarzin.me
|
||||
|
||||
# Note: ymuski/curl-http3 is amd64-only; it will fail on arm64 nodes
|
||||
```
|
||||
|
||||
##### Checking Traefik Logs
|
||||
|
||||
```bash
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=100 | grep -i quic
|
||||
```
|
||||
|
||||
### Verification Checklist
|
||||
|
||||
1. Traefik Service shows UDP port 443 (`websecure-http3`)
|
||||
2. `Alt-Svc` response header shows `h3=":443"` (not `h3=":8443"`)
|
||||
3. `/opt/homebrew/opt/curl/bin/curl --http3-only` successfully connects
|
||||
4. Cloudflare zone has HTTP/3 enabled (for proxied domains)
|
||||
|
||||
### Current Configuration (This Repo)
|
||||
|
||||
- **Traefik config**: `modules/kubernetes/traefik/main.tf` (lines 89-92)
|
||||
- **Cloudflare HTTP/3**: `modules/kubernetes/cloudflared/cloudflare.tf` (line 153)
|
||||
- **MetalLB IP**: 10.0.20.202 (Traefik LoadBalancer service)
|
||||
|
||||
### Notes
|
||||
|
||||
- HTTP/3 uses QUIC over UDP. Firewalls must allow UDP 443 inbound.
|
||||
- Traefik automatically handles TLS for HTTP/3 using the same certs as HTTPS.
|
||||
- The `Alt-Svc` header is sent on HTTP/2 responses to tell clients HTTP/3 is available.
|
||||
Clients then upgrade to HTTP/3 on subsequent requests.
|
||||
- For non-Cloudflare (direct DNS) domains, only the Traefik-side config is needed.
|
||||
- Cloudflare handles its own HTTP/3 negotiation with end users; the origin connection
|
||||
between Cloudflare and Traefik uses HTTP/1.1 or HTTP/2 (not HTTP/3).
|
||||
|
||||
---
|
||||
|
||||
## UDP Cross-Namespace Routing
|
||||
|
||||
### Problem
|
||||
|
||||
Adding a custom UDP entrypoint (e.g., DNS on port 53) to Traefik v3 via Helm chart values
|
||||
doesn't work out of the box. Traffic times out even though the Traefik pod listens on the
|
||||
port internally. Two separate issues compound:
|
||||
|
||||
1. The Helm chart defaults `expose` to `false` for custom entrypoints -- the port is never
|
||||
added to the LoadBalancer Service
|
||||
2. `allowCrossNamespace` defaults to `false` -- IngressRouteUDP in namespace A can't
|
||||
reference a Service in namespace B
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- Traefik Helm chart v39.0.0+ (Traefik v3.x)
|
||||
- Custom UDP entrypoint defined in `ports` values
|
||||
- `IngressRouteUDP` referencing a service in a different namespace
|
||||
- Symptoms:
|
||||
- `kubectl get svc traefik` doesn't show your custom UDP port
|
||||
- UDP traffic to the LoadBalancer IP times out
|
||||
- Traefik logs show: `"udp service <namespace>/<service> is not in the parent resource namespace <traefik-namespace>"`
|
||||
- `netstat -ulnp` inside Traefik pod confirms it IS listening on the port
|
||||
|
||||
### Solution
|
||||
|
||||
#### Fix 1: Expose the UDP port on the Service
|
||||
|
||||
In the Helm values, add `expose = { default = true }` to the entrypoint:
|
||||
|
||||
```hcl
|
||||
# Terraform HCL
|
||||
ports = {
|
||||
dns-udp = {
|
||||
port = 5353
|
||||
exposedPort = 53
|
||||
protocol = "UDP"
|
||||
expose = { default = true } # <-- Required for custom entrypoints
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Helm values YAML equivalent
|
||||
ports:
|
||||
dns-udp:
|
||||
port: 5353
|
||||
exposedPort: 53
|
||||
protocol: UDP
|
||||
expose:
|
||||
default: true
|
||||
```
|
||||
|
||||
Note: The built-in `web` and `websecure` entrypoints have `expose.default = true` by
|
||||
default, but custom entrypoints do NOT.
|
||||
|
||||
#### Fix 2: Enable cross-namespace CRD references
|
||||
|
||||
In the Helm values, add `allowCrossNamespace = true` to the kubernetesCRD provider:
|
||||
|
||||
```hcl
|
||||
# Terraform HCL
|
||||
providers = {
|
||||
kubernetesCRD = {
|
||||
enabled = true
|
||||
allowCrossNamespace = true # <-- Required for cross-namespace IngressRouteUDP
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```yaml
|
||||
# Helm values YAML
|
||||
providers:
|
||||
kubernetesCRD:
|
||||
enabled: true
|
||||
allowCrossNamespace: true
|
||||
```
|
||||
|
||||
This is required whenever an `IngressRouteUDP` (or `IngressRouteTCP`, `IngressRoute`)
|
||||
references a Kubernetes Service in a different namespace.
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# 1. Verify the port appears in the Service
|
||||
kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[*].name}'
|
||||
# Should include your custom entrypoint name (e.g., "dns-udp")
|
||||
|
||||
# 2. Check Traefik logs for cross-namespace errors
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "not in the parent resource namespace"
|
||||
# Should return nothing after the fix
|
||||
|
||||
# 3. Test the UDP service
|
||||
dig @<traefik-lb-ip> example.com
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
DNS forwarding through Traefik to Technitium DNS:
|
||||
- IngressRouteUDP in `traefik` namespace routes `dns-udp` entrypoint to
|
||||
`technitium-dns:53` in `technitium` namespace
|
||||
- Without Fix 1: port 53 never exposed on LoadBalancer -- traffic can't reach Traefik
|
||||
- Without Fix 2: Traefik rejects the route -- logs error every ~60 seconds
|
||||
- With both fixes: DNS queries to LoadBalancer IP:53 -> Traefik -> Technitium
|
||||
|
||||
### Notes
|
||||
|
||||
1. **Debugging order matters**: Fix 1 (expose) must come first. Without the port on the
|
||||
Service, you can't even test if the routing works. Fix 2 (cross-namespace) errors only
|
||||
appear in Traefik logs, not as user-visible failures.
|
||||
2. **`allowCrossNamespace` is a security consideration**: It allows any IngressRoute CRD
|
||||
to reference services in any namespace. If this is too broad, consider using
|
||||
`TraefikService` middleware or moving the IngressRouteUDP to the target namespace.
|
||||
3. **Rolling update**: Changing `allowCrossNamespace` triggers a Traefik pod restart
|
||||
(new CLI args). Changing `expose` only updates the Service (no pod restart needed).
|
||||
4. **This applies to TCP too**: `IngressRouteTCP` with cross-namespace services needs the
|
||||
same `allowCrossNamespace` setting.
|
||||
|
||||
---
|
||||
|
||||
## Plugin Download Failure (Global 404)
|
||||
|
||||
### Problem
|
||||
|
||||
After a node maintenance operation (containerd restart, node drain/uncordon, etc.),
|
||||
all Traefik-managed routes return 404. Services, Ingresses, and Middlewares all exist
|
||||
and look correct, making this extremely confusing to debug.
|
||||
|
||||
### Context / Trigger Conditions
|
||||
|
||||
- ALL Traefik routes return 404 simultaneously (not just one service)
|
||||
- Traefik pods are Running and Ready
|
||||
- Ingress resources exist with correct annotations
|
||||
- Middlewares exist in the correct namespaces
|
||||
- TLS secrets exist
|
||||
- Traefik startup logs contain: `Plugins are disabled because an error has occurred`
|
||||
- Plugin download error: `unable to download plugin ... context deadline exceeded`
|
||||
- Happened after a node restart, containerd restart, or network disruption
|
||||
|
||||
### Root Cause
|
||||
|
||||
Traefik downloads plugins (crowdsec-bouncer, rewrite-body, etc.) from
|
||||
`plugins.traefik.io` on **every pod startup**. If the download fails (network
|
||||
unreachable, DNS not ready, timeout), Traefik **disables ALL plugins entirely**.
|
||||
|
||||
Since the `crowdsec` middleware is a plugin-based middleware referenced in virtually
|
||||
every Ingress annotation (`traefik-crowdsec@kubernetescrd`), Traefik treats the
|
||||
missing plugin middleware as a fatal routing error and returns 404 for every route
|
||||
that references it -- which is typically all of them.
|
||||
|
||||
### Solution
|
||||
|
||||
```bash
|
||||
# 1. Confirm the diagnosis - check Traefik startup logs
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | head -20
|
||||
# Look for: "Plugins are disabled because an error has occurred"
|
||||
|
||||
# 2. Verify outbound connectivity is restored
|
||||
kubectl exec -n traefik $(kubectl get pods -n traefik -l app.kubernetes.io/name=traefik \
|
||||
-o jsonpath='{.items[0].metadata.name}') -- wget -q -O- --timeout=5 https://plugins.traefik.io
|
||||
|
||||
# 3. Rollout restart to retry plugin download
|
||||
kubectl rollout restart deployment -n traefik traefik
|
||||
|
||||
# 4. Verify plugins loaded
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik | grep "Plugins"
|
||||
# Should show: "Plugins loaded."
|
||||
|
||||
# 5. Verify routes work
|
||||
curl -s -o /dev/null -w "%{http_code}" -H "Host: viktorbarzin.me" https://10.0.20.202 -k
|
||||
# Should return 200 instead of 404
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
- Traefik logs show `Plugins loaded.` (not `Plugins are disabled`)
|
||||
- Routes return expected HTTP status codes (200, 302, etc.) instead of 404
|
||||
- `kubectl logs -n traefik <pod> | grep "does not exist"` shows no middleware errors
|
||||
|
||||
### Why This Is Hard to Debug
|
||||
|
||||
1. **Traefik pods show Running/Ready** -- health checks pass even without plugins
|
||||
2. **All Kubernetes resources look correct** -- Ingresses, Services, Middlewares all exist
|
||||
3. **The error is in startup logs only** -- not in per-request logs (requests just get 404)
|
||||
4. **The 404 is Traefik's default** -- same as "no route matched", not a backend error
|
||||
5. **The middleware error is logged once at startup** -- easy to miss in a stream of logs
|
||||
|
||||
### Prevention
|
||||
|
||||
- During planned maintenance (node drain, containerd restart), restart Traefik pods
|
||||
AFTER network connectivity is confirmed restored
|
||||
- Consider pre-caching Traefik plugins in the container image or using an init container
|
||||
- Monitor for the `Plugins are disabled` log message in your alerting system
|
||||
|
||||
### Notes
|
||||
|
||||
- This affects ALL plugin-based middlewares, not just crowdsec
|
||||
- The `rewrite-body` plugin (used for rybbit analytics injection) is also affected
|
||||
- Traefik v3.x downloads plugins on every startup; there is no persistent cache
|
||||
- If only some routes return 404, the problem is likely different (missing middleware
|
||||
or TLS secret, not a plugin issue)
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- [Traefik HTTP/3 Documentation](https://doc.traefik.io/traefik/routing/entrypoints/#http3)
|
||||
- [Traefik Helm Chart Values](https://github.com/traefik/traefik-helm-chart/blob/master/traefik/values.yaml)
|
||||
- [Cloudflare HTTP/3 Settings](https://developers.cloudflare.com/speed/optimization/protocol/http3/)
|
||||
- [Traefik Helm Chart Ports Configuration](https://github.com/traefik/traefik-helm-chart)
|
||||
- [Traefik v3 Providers Documentation](https://doc.traefik.io/traefik/providers/kubernetes-crd/)
|
||||
|
||||
## See Also
|
||||
|
||||
- `traefik-rewrite-body-troubleshooting` -- Traefik rewrite-body plugin troubleshooting (compression, Accept header issues)
|
||||
- `helm-release-force-rerender` -- Force Helm chart re-render when structural changes don't take effect
|
||||
|
|
@ -0,0 +1,200 @@
|
|||
---
|
||||
name: traefik-rewrite-body-troubleshooting
|
||||
description: |
|
||||
Troubleshooting guide for the Traefik rewrite-body plugin (packruler/rewrite-body).
|
||||
Covers two failure modes: (1) Compression failure — plugin logs "flate: corrupt input
|
||||
before offset 5" when backends send gzip-compressed responses, corrupting response
|
||||
bodies and breaking WebSocket connections, authentication flows, and mobile app
|
||||
connectivity. (2) Silent skip — plugin silently skips content injection (rybbit
|
||||
analytics, trap links, or any HTML rewriting) when the request Accept header doesn't
|
||||
contain "text/html" (e.g., curl's default Accept: */*), making it appear broken
|
||||
despite correct configuration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-22
|
||||
---
|
||||
|
||||
# Traefik Rewrite-Body Plugin Troubleshooting
|
||||
|
||||
Two distinct failure modes for the `packruler/rewrite-body` Traefik plugin used for
|
||||
injecting analytics scripts (rybbit) and anti-AI trap links into HTML responses.
|
||||
|
||||
---
|
||||
|
||||
## Problem 1: Compression Failure
|
||||
|
||||
### Symptoms
|
||||
- Traefik logs show: `Rewrite-Body | ERROR ... Error loading content: flate: corrupt input before offset 5`
|
||||
- Mobile apps (e.g., Home Assistant Companion) fail while browser works
|
||||
- HA Companion app shows repeated `GET /?external_auth=1` requests (auth loop)
|
||||
- WebSocket connections (`/api/websocket`) are very short-lived (seconds instead of minutes)
|
||||
- HTTP 499 errors on API calls (client disconnects due to corrupted responses)
|
||||
- Using `packruler/rewrite-body` plugin v1.2.0 with `monitoring.types = ["text/html"]`
|
||||
|
||||
### Root Cause
|
||||
Despite the `monitoring.types = ["text/html"]` filter, the plugin attempts to decompress
|
||||
ALL responses before checking content type. When decompression fails on certain gzip
|
||||
encodings, it corrupts the response body, breaking:
|
||||
- WebSocket upgrade handshakes
|
||||
- Authentication flows (HA Companion app's `external_auth` callback)
|
||||
- Mobile app connectivity (while browser appears to work due to auto-reconnect)
|
||||
|
||||
### Misleading Symptoms
|
||||
- HTTP/3 (QUIC) may appear to be the cause because HTTP/3 requests show 499 errors.
|
||||
This is a red herring -- the rewrite-body plugin corruption affects all protocols.
|
||||
- WebSocket issues may look like a timeout or proxy configuration problem.
|
||||
- The `monitoring.types = ["text/html"]` config suggests the plugin should only touch
|
||||
HTML, but it still processes all responses for decompression before filtering.
|
||||
|
||||
### Solution
|
||||
|
||||
#### Step 1: Create a strip-accept-encoding middleware
|
||||
Add a Traefik middleware that removes `Accept-Encoding` from requests, forcing
|
||||
backends to send uncompressed responses that the plugin can safely process:
|
||||
|
||||
```hcl
|
||||
# In traefik/middleware.tf
|
||||
resource "kubernetes_manifest" "middleware_strip_accept_encoding" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "strip-accept-encoding"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
headers = {
|
||||
customRequestHeaders = {
|
||||
"Accept-Encoding" = ""
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
```
|
||||
|
||||
#### Step 2: Add middleware to routes with rewrite-body
|
||||
In the ingress factory middleware chain, add `strip-accept-encoding` BEFORE the
|
||||
rewrite-body middleware:
|
||||
|
||||
```hcl
|
||||
var.rybbit_site_id != null ? "traefik-strip-accept-encoding@kubernetescrd" : null,
|
||||
var.rybbit_site_id != null ? "${var.namespace}-rybbit-analytics-${var.name}@kubernetescrd" : null,
|
||||
```
|
||||
|
||||
The order matters: strip-accept-encoding must come first so the request reaches
|
||||
the backend without Accept-Encoding, and the uncompressed response then passes
|
||||
through the rewrite-body plugin.
|
||||
|
||||
### Verification (Compression Fix)
|
||||
1. Check Traefik logs for absence of `flate: corrupt input` errors:
|
||||
```bash
|
||||
kubectl logs -n traefik -l app.kubernetes.io/name=traefik --tail=200 | grep -i "flate\|rewrite-body"
|
||||
```
|
||||
2. Verify the middleware chain includes strip-accept-encoding before rybbit:
|
||||
```bash
|
||||
kubectl get ingress -n <namespace> <name> -o jsonpath='{.metadata.annotations.traefik\.ingress\.kubernetes\.io/router\.middlewares}'
|
||||
```
|
||||
3. Test mobile app connectivity (HA Companion, etc.)
|
||||
|
||||
### Notes (Compression)
|
||||
- This affects ALL services using the rewrite-body plugin, not just HA
|
||||
- The fix is applied conditionally: `strip-accept-encoding` is only added to the
|
||||
middleware chain when `rybbit_site_id` is set, so services without analytics
|
||||
are unaffected
|
||||
- Both `ingress_factory` and `reverse_proxy/factory` modules need the fix
|
||||
- Traefik may still compress responses to clients via its own compression middleware;
|
||||
the strip only affects the backend request
|
||||
- The plugin's `monitoring.types` filter works for deciding what to rewrite, but
|
||||
decompression is attempted on all responses regardless
|
||||
|
||||
---
|
||||
|
||||
## Problem 2: Silent Skip (Accept Header Mismatch)
|
||||
|
||||
### Symptoms
|
||||
- rewrite-body middleware is in the ingress middleware chain and shows status "enabled" in Traefik API
|
||||
- `curl https://example.com/` returns original HTML with no injected content
|
||||
- Browser shows injected content (rybbit script, trap links, etc.)
|
||||
- No errors in Traefik logs -- the plugin silently skips processing
|
||||
- `monitoring.types = ["text/html"]` is configured in the middleware spec
|
||||
- Middleware chain order is correct (strip-accept-encoding before rewrite-body)
|
||||
|
||||
### Root Cause
|
||||
In the plugin source code, `SupportsProcessing()` checks the **request** `Accept`
|
||||
header (not the response `Content-Type`) against `monitoring.types`:
|
||||
|
||||
```go
|
||||
func (r *Rewriter) SupportsProcessing(req *http.Request) bool {
|
||||
accept := req.Header.Get("Accept")
|
||||
for _, monitoringType := range r.monitoring.Types {
|
||||
if strings.Contains(accept, monitoringType) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
```
|
||||
|
||||
It uses `strings.Contains(accept, "text/html")`. The curl default `Accept: */*` does
|
||||
NOT contain the substring `text/html`, so the plugin returns false and skips all
|
||||
processing. Browser requests include `Accept: text/html,application/xhtml+xml,...`
|
||||
which does match.
|
||||
|
||||
### Misleading Symptoms
|
||||
- Appears as if the middleware isn't working at all
|
||||
- May look like a middleware ordering issue or configuration error
|
||||
- `kubectl get middleware` shows the resource exists with correct spec
|
||||
- Traefik API (`/api/http/middlewares/`) shows the middleware as "enabled"
|
||||
- Checking the rewrite-body regex patterns seems pointless since nothing is being processed
|
||||
|
||||
### Solution
|
||||
This is **working as designed** -- not a bug. The fix depends on context:
|
||||
|
||||
#### For testing with curl
|
||||
Add the `Accept` header to simulate a browser:
|
||||
```bash
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/
|
||||
```
|
||||
|
||||
#### For verifying injection is working
|
||||
```bash
|
||||
# Check for injected content (trap links, analytics, etc.)
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
|
||||
| grep -oE 'href="https://poison[^"]*"'
|
||||
|
||||
# Check for rybbit analytics
|
||||
curl -s -H "Accept: text/html,application/xhtml+xml" https://example.com/ \
|
||||
| grep -oE 'src="https://rybbit[^"]*"'
|
||||
```
|
||||
|
||||
#### For programmatic clients that need injection
|
||||
If a non-browser client needs to receive injected content, ensure it sends
|
||||
`Accept: text/html` in its request headers.
|
||||
|
||||
### Verification (Accept Header)
|
||||
```bash
|
||||
# Without Accept header -- no injection (expected)
|
||||
curl -s https://example.com/ | grep -c "rybbit"
|
||||
# Output: 0
|
||||
|
||||
# With Accept header -- injection works
|
||||
curl -s -H "Accept: text/html" https://example.com/ | grep -c "rybbit"
|
||||
# Output: 1
|
||||
```
|
||||
|
||||
### Notes (Accept Header)
|
||||
- This behavior is independent of the compression issue (Problem 1 above)
|
||||
- The check is on the **request** `Accept` header, not the **response** `Content-Type`
|
||||
- `Accept: */*` does NOT match -- `strings.Contains("*/*", "text/html")` is false
|
||||
- Real AI scrapers typically send browser-like Accept headers, so trap links will be
|
||||
injected for them correctly
|
||||
- API calls (which typically send `Accept: application/json`) are correctly skipped
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
- `traefik-helm-configuration` -- Traefik Helm chart configuration and entrypoints
|
||||
- `ingress-factory-migration` -- Covers the ingress factory module that creates
|
||||
rybbit analytics middlewares
|
||||
454
.claude/skills/cluster-health/SKILL.md
Normal file
454
.claude/skills/cluster-health/SKILL.md
Normal file
|
|
@ -0,0 +1,454 @@
|
|||
---
|
||||
name: cluster-health
|
||||
description: |
|
||||
Check Kubernetes cluster health and fix common issues. Use when:
|
||||
(1) User asks to check the cluster, check health, or "what's wrong",
|
||||
(2) User asks about pod status, node health, or deployment issues,
|
||||
(3) User asks to fix stuck pods, evicted pods, or CrashLoopBackOff,
|
||||
(4) User mentions "health check", "cluster status", "cluster health",
|
||||
(5) User asks "is everything running" or "any problems".
|
||||
Runs 47 cluster-wide checks (nodes, workloads, monitoring, certs,
|
||||
backups, external reachability, PVE host thermals + load, HA Sofia
|
||||
status dashboard, Immich smart-search, Proxmox CSI ghost-disk drift)
|
||||
with safe auto-fix for evicted pods.
|
||||
author: Claude Code
|
||||
version: 2.0.0
|
||||
date: 2026-04-19
|
||||
---
|
||||
|
||||
# Cluster Health Check
|
||||
|
||||
## MANDATORY: Run the script first
|
||||
|
||||
When this skill is invoked, your **first action** must be to run the
|
||||
cluster health check script and reason over its output before doing
|
||||
anything else. Do not improvise individual `kubectl` calls — the
|
||||
script is the authoritative surface.
|
||||
|
||||
```bash
|
||||
cd /home/wizard/code
|
||||
bash infra/scripts/cluster_healthcheck.sh --json | tee /tmp/cluster-health.json
|
||||
```
|
||||
|
||||
If the session is rooted elsewhere, fall back to the absolute path:
|
||||
|
||||
```bash
|
||||
bash /home/wizard/code/infra/scripts/cluster_healthcheck.sh --json
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
1. Parse the JSON. Report the PASS/WARN/FAIL counts + overall verdict.
|
||||
2. Iterate every FAIL and WARN check, describe what tripped, and propose
|
||||
the remediation path (use the recipes below).
|
||||
3. Only reach for ad-hoc `kubectl` commands when investigating a
|
||||
specific failure beyond what the script reported.
|
||||
|
||||
Exit codes: `0` = healthy, `1` = warnings only, `2` = failures.
|
||||
|
||||
## Quick flags
|
||||
|
||||
```bash
|
||||
# Human-readable report (default), no auto-fix
|
||||
bash infra/scripts/cluster_healthcheck.sh
|
||||
|
||||
# Machine-readable JSON summary
|
||||
bash infra/scripts/cluster_healthcheck.sh --json
|
||||
|
||||
# Only show WARN + FAIL (suppress PASS noise)
|
||||
bash infra/scripts/cluster_healthcheck.sh --quiet
|
||||
|
||||
# Enable auto-fix (delete evicted pods, kick stuck CrashLoop pods)
|
||||
bash infra/scripts/cluster_healthcheck.sh --fix
|
||||
|
||||
# Combined: quiet JSON without auto-fix
|
||||
bash infra/scripts/cluster_healthcheck.sh --no-fix --quiet --json
|
||||
|
||||
# Custom kubeconfig
|
||||
bash infra/scripts/cluster_healthcheck.sh --kubeconfig /path/to/config
|
||||
```
|
||||
|
||||
## What It Checks (47 checks)
|
||||
|
||||
| # | Check | Notes |
|
||||
|---|-------|-------|
|
||||
| 1 | Node Status | NotReady nodes, version drift |
|
||||
| 2 | Node Resources | CPU/mem >80% (warn) / >90% (fail) |
|
||||
| 3 | Node Conditions | MemoryPressure / DiskPressure / PIDPressure |
|
||||
| 4 | Problematic Pods | CrashLoopBackOff / Error / ImagePullBackOff |
|
||||
| 5 | Evicted/Failed Pods | `status.phase=Failed` |
|
||||
| 6 | DaemonSets | desired == ready |
|
||||
| 7 | Deployments | ready == desired replicas |
|
||||
| 8 | PVC Status | all Bound |
|
||||
| 9 | HPA Health | targets not `<unknown>`, utilization <100% |
|
||||
| 10 | CronJob Failures | job conditions `Failed=True` in last 24h |
|
||||
| 11 | CrowdSec Agents | all pods Running |
|
||||
| 12 | Ingress Routes | every ingress has an LB IP + Traefik LB |
|
||||
| 13 | Prometheus Alerts | count of firing alerts |
|
||||
| 14 | Uptime Kuma Monitors | internal + external monitors up |
|
||||
| 15 | ResourceQuota Pressure | any quota >80% used |
|
||||
| 16 | StatefulSets | ready == desired |
|
||||
| 17 | Node Disk Usage | ephemeral-storage <80% |
|
||||
| 18 | Helm Release Health | all `deployed` (no `pending-*`) |
|
||||
| 19 | Kyverno Policy Engine | all pods Running |
|
||||
| 20 | NFS Connectivity | 192.168.1.127 showmount / port 2049 |
|
||||
| 21 | DNS Resolution | Technitium resolves internal + external |
|
||||
| 22 | TLS Certificate Expiry | TLS `Secret` certs >30d valid |
|
||||
| 23 | GPU Health | nvidia namespace + device-plugin Running |
|
||||
| 24 | Cloudflare Tunnel | pods Running |
|
||||
| 25 | Resource Usage | node CPU/mem headroom |
|
||||
| 26 | HA Sofia — Entity Availability | Home Assistant unavailable/unknown count |
|
||||
| 27 | HA Sofia — Integration Health | config entries setup_error / not_loaded |
|
||||
| 28 | HA Sofia — Automation Status | disabled / stale (>30d) automations |
|
||||
| 29 | HA Sofia — System Resources | HA CPU / mem / disk |
|
||||
| 30 | Hardware Exporters | snmp / idrac-redfish / proxmox / tuya pods + scrapes |
|
||||
| 31 | cert-manager — Certificate Readiness | Certificate CRs with `Ready!=True` |
|
||||
| 32 | cert-manager — Certificate Expiry (<14d) | notAfter within 14d |
|
||||
| 33 | cert-manager — Failed CertificateRequests | `Ready=False, reason=Failed` |
|
||||
| 34 | Backup Freshness — Per-DB Dumps | MySQL + PG dumps within 25h |
|
||||
| 35 | Backup Freshness — Offsite Sync | Pushgateway `backup_last_success_timestamp` <27h |
|
||||
| 36 | Backup Freshness — LVM PVC Snapshots | newest thin snapshot <25h (SSH PVE) |
|
||||
| 37 | Monitoring — Prometheus + Alertmanager | `/-/ready` + AM pods Running |
|
||||
| 38 | Monitoring — Vault Sealed Status | `vault status` reports `Sealed: false` |
|
||||
| 39 | Monitoring — ClusterSecretStore Ready | `vault-kv` + `vault-database` Ready |
|
||||
| 40 | External — Cloudflared + Authentik Replicas | deployments fully ready |
|
||||
| 41 | External — ExternalAccessDivergence Alert | alert not firing |
|
||||
| 42 | External — Traefik 5xx Rate (15m) | top-10 services emitting 5xx |
|
||||
| 43 | PVE Host Thermals | package + per-core temps via `/sys/class/hwmon` (SSH). Baseline 55-65 °C. PASS <65 °C, WARN 65-82 °C (a VM is burning too much CPU), FAIL ≥83 °C (TjMax) |
|
||||
| 44 | PVE Host Load | `/proc/loadavg` via SSH. PASS 5m <30, WARN 30-37, FAIL ≥38 of 44 threads |
|
||||
| 45 | HA Sofia — Status Dashboard | emo's curated Барзини → Статус view (`dashboard-barzini` / path `status`). Pulls the lovelace config via WS, batch-renders every `custom:mushroom-template-card` secondary template against `/api/template`, classifies each rendered line: FAIL on `Offline` / `Disconnected` / `Разкачен` / `— No data`; WARN on `⚠️` / `Abnormal` / `Trouble (` / `(ниска)` / `Пълен резервоар` / `Грешка` / `attention` / `Внимание`. Verdict rolls up across the 8 sections (Сигурност, Мрежа & IT, Енергия, Климат, Уреди, Мултимедия, Осветление, Поливна) |
|
||||
| 46 | Immich Smart Search | `clip_index` residency in PG `shared_buffers` + representative ANN probe latency (in immich-postgresql). FAIL >1.5s or <50% resident; WARN >0.5s or <90% resident. Cold cache → check `clip-index-prewarm` CronJob |
|
||||
| 47 | Proxmox CSI — Ghost-Disk Drift | Per node, compares real virtio-scsi CSI disks in `qm config <vmid>` (SSH PVE) vs attached proxmox-CSI VolumeAttachments k8s tracks. Catches orphaned "ghost" disks left by failed detaches (`query-pci` QMP timeouts) that the scheduler's 28-LUN guard can't see. PASS reconciled; WARN drift>0 or real 20-24; FAIL real ≥25 (near LUN cap → imminent wedge). Cleanup: detach ghosts via `qm set <vmid> --delete scsiN` (frees slot, retains LV) |
|
||||
|
||||
## Safe Auto-Fix Rules
|
||||
|
||||
`--fix` only performs operations that are genuinely reversible and
|
||||
observable. Nothing here rewrites Terraform state or mutates the cluster
|
||||
beyond "delete pod".
|
||||
|
||||
### Done automatically by `--fix`
|
||||
|
||||
- **Evicted / Failed pods** — delete them; the controller recreates.
|
||||
```bash
|
||||
kubectl delete pods -A --field-selector=status.phase=Failed
|
||||
```
|
||||
- **CrashLoopBackOff pods with >10 restarts** — delete once to reset
|
||||
backoff timer.
|
||||
|
||||
### NEVER auto-fix (requires human investigation)
|
||||
|
||||
- NotReady nodes
|
||||
- MemoryPressure / DiskPressure / PIDPressure
|
||||
- ImagePullBackOff (usually a bad tag / registry credential)
|
||||
- Deployment ready-replica mismatch
|
||||
- Pending PVCs
|
||||
- Node CPU/memory >90%
|
||||
- CronJob failures
|
||||
- DaemonSet desired != ready
|
||||
- Vault sealed
|
||||
- ClusterSecretStore not Ready
|
||||
- cert-manager Certificate failures
|
||||
- Backup freshness regressions
|
||||
- Any external-reachability failure
|
||||
|
||||
## Deep-investigation recipes per failure mode
|
||||
|
||||
### Node Issues (checks 1, 3, 17, 25)
|
||||
|
||||
```bash
|
||||
kubectl describe node <node>
|
||||
kubectl top nodes
|
||||
kubectl get events --field-selector involvedObject.name=<node> --sort-by='.lastTimestamp'
|
||||
# SSH to the node
|
||||
ssh root@10.0.20.10X
|
||||
systemctl status kubelet
|
||||
journalctl -u kubelet --since "30 minutes ago" | tail -100
|
||||
df -h ; free -h
|
||||
```
|
||||
|
||||
Node IPs: `10.0.20.100` master, `.101` node1 (GPU), `.102` node2,
|
||||
`.103` node3, `.104` node4.
|
||||
|
||||
### Pod Issues (checks 4, 5, 11, 19)
|
||||
|
||||
```bash
|
||||
kubectl describe pod -n <ns> <pod>
|
||||
kubectl logs -n <ns> <pod> --tail=200
|
||||
kubectl logs -n <ns> <pod> --previous --tail=200
|
||||
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
Common failure causes: OOMKilled (raise mem limit in Terraform), bad
|
||||
config / missing env var, DB connection failure (check `dbaas` pods),
|
||||
NFS mount failure (`showmount -e 192.168.1.127`), stale
|
||||
imagePullSecret.
|
||||
|
||||
### Deployment / StatefulSet / DaemonSet (checks 6, 7, 16)
|
||||
|
||||
```bash
|
||||
kubectl describe deployment -n <ns> <name>
|
||||
kubectl rollout status deployment -n <ns> <name>
|
||||
kubectl rollout history deployment -n <ns> <name>
|
||||
kubectl get rs -n <ns> -l app=<app>
|
||||
```
|
||||
|
||||
### PVC (check 8)
|
||||
|
||||
```bash
|
||||
kubectl describe pvc -n <ns> <pvc>
|
||||
kubectl get events -n <ns> --field-selector reason=FailedMount --sort-by='.lastTimestamp'
|
||||
kubectl get pv | grep <pvc>
|
||||
showmount -e 192.168.1.127
|
||||
```
|
||||
|
||||
### cert-manager (checks 31, 32, 33)
|
||||
|
||||
```bash
|
||||
kubectl get certificate -A
|
||||
kubectl describe certificate -n <ns> <name>
|
||||
kubectl get certificaterequest -A
|
||||
kubectl describe certificaterequest -n <ns> <name>
|
||||
kubectl logs -n cert-manager deploy/cert-manager | tail -50
|
||||
```
|
||||
|
||||
Common causes: ACME HTTP-01 challenge blocked, ClusterIssuer missing
|
||||
DNS provider secret, rate-limit from Let's Encrypt.
|
||||
|
||||
### Backups (checks 34, 35, 36)
|
||||
|
||||
```bash
|
||||
# Per-DB dumps (inside the DB pod)
|
||||
kubectl exec -n dbaas mysql-standalone-0 -- ls -lah /backup/per-db/
|
||||
kubectl exec -n dbaas pg-cluster-0 -- ls -lah /backup/per-db/
|
||||
|
||||
# Pushgateway metrics
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- http://prometheus-prometheus-pushgateway:9091/metrics | \
|
||||
grep backup_last_success_timestamp
|
||||
|
||||
# LVM snapshots on PVE host
|
||||
ssh -o BatchMode=yes root@192.168.1.127 \
|
||||
'lvs -o lv_name,lv_time,lv_size --noheadings | grep snap'
|
||||
```
|
||||
|
||||
If offsite sync is stale, the common cause is the
|
||||
`offsite-sync-backup.service` systemd unit on the PVE host failing.
|
||||
`ssh root@192.168.1.127 'systemctl status offsite-sync-backup'`.
|
||||
|
||||
### Monitoring stack (checks 37, 38, 39)
|
||||
|
||||
```bash
|
||||
# Prometheus
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- wget -qO- http://localhost:9090/-/ready
|
||||
kubectl logs -n monitoring deploy/prometheus-server --tail=100
|
||||
|
||||
# Alertmanager
|
||||
kubectl get pods -n monitoring | grep alertmanager
|
||||
kubectl logs -n monitoring -l app=prometheus-alertmanager --tail=100
|
||||
|
||||
# Vault
|
||||
kubectl exec -n vault vault-0 -- sh -c 'VAULT_ADDR=http://127.0.0.1:8200 vault status'
|
||||
# If sealed: check raft peers with `vault operator raft list-peers` and unseal.
|
||||
|
||||
# ClusterSecretStore
|
||||
kubectl get clustersecretstore
|
||||
kubectl describe clustersecretstore vault-kv vault-database
|
||||
kubectl logs -n external-secrets deploy/external-secrets --tail=100
|
||||
```
|
||||
|
||||
### External reachability (checks 40, 41, 42)
|
||||
|
||||
```bash
|
||||
# Cloudflared
|
||||
kubectl get pods -n cloudflared
|
||||
kubectl logs -n cloudflared -l app=cloudflared --tail=100
|
||||
|
||||
# Authentik (Helm chart names the deployment goauthentik-server)
|
||||
kubectl get deployment -n authentik goauthentik-server
|
||||
kubectl logs -n authentik deploy/goauthentik-server --tail=100
|
||||
|
||||
# ExternalAccessDivergence alert
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/alerts' | \
|
||||
python3 -m json.tool | grep -A 5 ExternalAccessDivergence
|
||||
|
||||
# Traefik 5xx — find the hot service
|
||||
kubectl exec -n monitoring deploy/prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/query?query=topk(10,rate(traefik_service_requests_total{code=~%225..%22}%5B15m%5D))' \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
### OOMKilled remediation
|
||||
|
||||
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Limits`
|
||||
2. Edit `infra/modules/kubernetes/<service>/main.tf` and raise
|
||||
`resources.limits.memory`.
|
||||
3. `cd /home/wizard/code/infra && scripts/tg apply` (Tier 1) or
|
||||
`terraform apply -target=module.<service>` as appropriate.
|
||||
|
||||
### ImagePullBackOff remediation
|
||||
|
||||
1. `kubectl describe pod -n <ns> <pod> | grep -A 5 Events`
|
||||
2. Verify tag exists on the source registry.
|
||||
3. Check pull-through cache at `10.0.20.10:{5000,5010,5020,5030}`.
|
||||
4. Update the image tag in Terraform + re-apply.
|
||||
|
||||
### Persistent CrashLoopBackOff after auto-fix
|
||||
|
||||
1. `kubectl logs -n <ns> <pod> --previous --tail=200`
|
||||
2. `kubectl describe pod -n <ns> <pod>` and check Last State:
|
||||
- `OOMKilled` → raise memory limit
|
||||
- Exit code 137 → OOM or probe killed
|
||||
- Exit code 143 → SIGTERM / graceful shutdown failed
|
||||
3. Cross-check dbaas + NFS + secrets are healthy.
|
||||
|
||||
## Performance forensics — top consumers + optimization hints
|
||||
|
||||
When the cluster is healthy (script returns 0) but the host is hot or load
|
||||
is elevated, switch from "what broke?" to "what's expensive?". Run these
|
||||
in order; stop as soon as the root cause is obvious.
|
||||
|
||||
### Step 1 — Snapshot top consumers cluster-wide
|
||||
|
||||
```bash
|
||||
# Top 15 pods by current CPU
|
||||
kubectl top pods --all-namespaces --sort-by=cpu --no-headers | head -15
|
||||
|
||||
# Top 5 nodes by CPU + memory pressure
|
||||
kubectl top nodes
|
||||
|
||||
# Top 15 by 5-min rolling rate (smoothed — kills noise from one-off spikes)
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(namespace,pod)%20(rate(container_cpu_usage_seconds_total%7Bcontainer!%3D''%7D%5B5m%5D)))" \
|
||||
| python3 -m json.tool | head -80
|
||||
```
|
||||
|
||||
### Step 2 — For each suspect pod, get the WHY
|
||||
|
||||
For every pod in the top-N, gather these BEFORE proposing a fix:
|
||||
|
||||
```bash
|
||||
NS=<namespace>; POD=<pod>; CONT=$(kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].name}')
|
||||
|
||||
# What it does (image + command)
|
||||
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].image}{"\n"}{.spec.containers[0].args}{"\n"}'
|
||||
|
||||
# Resource limits + current usage
|
||||
kubectl -n $NS top pod $POD --containers
|
||||
kubectl -n $NS get pod $POD -o jsonpath='{.spec.containers[0].resources}'
|
||||
|
||||
# Recent logs filtered for reconcile loops, watch storms, slow queries
|
||||
kubectl -n $NS logs $POD -c $CONT --tail=200 --since=5m 2>&1 \
|
||||
| grep -iE 'reconcil|watch|scrape|index|loop|retry|slow|timeout' | tail -20
|
||||
|
||||
# Restart count + recent OOM
|
||||
kubectl -n $NS describe pod $POD | grep -E 'Restart Count|Last State|Reason'
|
||||
|
||||
# Self-exported metrics (for apps that publish on /metrics)
|
||||
kubectl -n $NS exec $POD -c $CONT -- wget -qO- localhost:<port>/metrics 2>/dev/null | head -50
|
||||
```
|
||||
|
||||
### Step 3 — apiserver / etcd specific deep-dive (when control-plane is hot)
|
||||
|
||||
```bash
|
||||
# Top request producers by verb+resource (last 30 min)
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(resource,verb)%20(rate(apiserver_request_total%5B30m%5D)))" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Top user agents (which clients are hammering)
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=topk(15,sum%20by%20(user_agent)%20(rate(apiserver_request_total%5B30m%5D)))" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Long-running requests (WATCH / CONNECT — log streams, pod-watchers)
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=apiserver_longrunning_requests" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# etcd write rate + DB size
|
||||
kubectl -n monitoring exec deploy/prometheus-server -- wget -qO- \
|
||||
"http://localhost:9090/api/v1/query?query=rate(etcd_disk_wal_fsync_duration_seconds_count%5B5m%5D)" \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
### Step 4 — PVE host specific deep-dive (when temp / load is high)
|
||||
|
||||
Checks 43 + 44 capture package temp + 5-min load avg with PASS/WARN/FAIL
|
||||
thresholds — that's the first stop. When those WARN or FAIL, the
|
||||
follow-up commands below trace which VM / process is the source:
|
||||
|
||||
```bash
|
||||
# Per-core temps (broader than the package summary in check 43)
|
||||
ssh root@192.168.1.127 'for f in /sys/class/hwmon/hwmon0/temp*_input; do
|
||||
base=${f%_input}; label=$(cat ${base}_label 2>/dev/null || echo "${base##*/}")
|
||||
val=$(cat "$f"); echo " $label: $((val/1000))°C"
|
||||
done'
|
||||
|
||||
# Per-VM CPU (each VM = one kvm process)
|
||||
ssh root@192.168.1.127 'top -bn1 -o %CPU | grep kvm | head -10'
|
||||
|
||||
# pvestatd anomaly check — bursts > 50% usually mean LV count > 1000
|
||||
ssh root@192.168.1.127 'lvs --noheadings 2>/dev/null | wc -l'
|
||||
|
||||
# Stale snapshots (any '_pre-*' that survived past their rollback window)
|
||||
ssh root@192.168.1.127 'lvs --noheadings -o lv_name 2>/dev/null | awk "/_pre-/" | head -20'
|
||||
```
|
||||
|
||||
### Step 5 — Optimization decision
|
||||
|
||||
For each consumer in the top-N, fill in a row:
|
||||
|
||||
| Pod / Process | CPU (m) | Why busy | Tunable | Est saving | Trade-off | Effort |
|
||||
|---|---|---|---|---|---|---|
|
||||
|
||||
Then rank by ROI (saving / effort) and surface the top 3-5. **Hold back the ones where saving < 50m unless effort is also < 5 min.**
|
||||
|
||||
### Common causes + tunables (catalogue)
|
||||
|
||||
| Symptom | Likely cause | Tunable |
|
||||
|---|---|---|
|
||||
| **`kube-apiserver` > 1 core sustained** | `CONNECT pods/log` streams from `alloy`/`promtail` using apiserver-tail; OR Kyverno PolicyReport churn (background+enforce mode); OR VPA fanout (309 VPAs cause ~7 req/s) | Switch alloy/promtail to `loki.source.file`; raise Kyverno `backgroundScanInterval`; reduce VPA count |
|
||||
| **`pvestatd` 70-100% bursts** | LV metadata scan over > 1000 LVs (typically stale `_pre-*` snapshots from ad-hoc node ops) | Delete stale snapshots; `/usr/local/bin/lvm-pvc-snapshot prune` |
|
||||
| **Frigate > 2 cores** | Birdseye `mode: continuous` (16% on frigate.output); LPR debug; debug logging; too many active cameras × detect.fps | `birdseye.mode: motion`; `lpr.debug_save_plates: false`; remove debug loggers |
|
||||
| **`vault-0` looping ERRORs every ~10s** | DB static-role not in connection's `allowed_roles` list (drift between role and connection) | Add role to `vault_database_secret_backend_connection.*.allowed_roles` in TF |
|
||||
| **Alloy DS > 100m/pod** | `loki.source.kubernetes` (apiserver-tail) instead of `loki.source.file` | Switch to file-tail (~5× drop per pod) |
|
||||
| **Prometheus default 1m scrape** | Chart default; new sample every minute | Raise `server.global.scrape_interval` to 2m; pin critical jobs (snmp-ups) to 30s; bump `for: 1m` alerts to `for: 3m` |
|
||||
| **`kube-controller-manager` periodic ERROR loop** | Aggregated APIService discovery fails (calico/metrics-server unreachable, OR stuck Terminating pod still in endpoints) | Force-delete stuck pod; verify APIService Available; check pod runc bug on k8s-master |
|
||||
| **etcd write > 1 MB/s** | PolicyReport thrash, too-frequent secret rotation, or audit log mode = RequestResponse | Trim Kyverno reports config; raise rotation_period; downgrade audit policy to Metadata for noisy resources |
|
||||
|
||||
### What NOT to touch
|
||||
|
||||
- **calico-node, etcd write rate, kube-controller-manager core work, pg-cluster replication** — structural cost, touching them risks correctness.
|
||||
- **Pods doing legitimate request-serving work** (web servers, databases under load) — optimize the workload, not the runtime.
|
||||
- **Anything where Goldilocks VPA upperBound is already close to current request** — no headroom to cut.
|
||||
|
||||
### Source-of-truth notes
|
||||
|
||||
- **All infra mutations go via Terraform** (`scripts/tg plan/apply`). The recipes above are diagnostic; the FIX lives in `infra/stacks/<name>/main.tf` or chart values.
|
||||
- **Pod-internal config files** (e.g., Frigate's `/config/config.yml` on a PVC) are not TF-managed — edit in-pod and document in `infra/docs/runbooks/`.
|
||||
- **PVE host-level state** (LVM snapshots, pvestatd) — SSH + manual ops; record in memory if the pattern recurs.
|
||||
|
||||
## Notes on the canonical / hardlink setup
|
||||
|
||||
The authoritative copy of this SKILL.md lives at
|
||||
`/home/wizard/code/.claude/skills/cluster-health/SKILL.md`. A hardlink
|
||||
at `/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md`
|
||||
points to the same inode so infra-rooted sessions also discover the
|
||||
skill.
|
||||
|
||||
To verify the hardlink is intact:
|
||||
|
||||
```bash
|
||||
stat -c '%i %n' \
|
||||
/home/wizard/code/.claude/skills/cluster-health/SKILL.md \
|
||||
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
|
||||
```
|
||||
|
||||
Both should print the same inode number. If they diverge (e.g. `git
|
||||
checkout` replaced the file rather than updating it), re-link:
|
||||
|
||||
```bash
|
||||
ln -f /home/wizard/code/.claude/skills/cluster-health/SKILL.md \
|
||||
/home/wizard/code/infra/.claude/skills/cluster-health/SKILL.md
|
||||
```
|
||||
215
.claude/skills/disk-wear/SKILL.md
Normal file
215
.claude/skills/disk-wear/SKILL.md
Normal file
|
|
@ -0,0 +1,215 @@
|
|||
---
|
||||
name: disk-wear
|
||||
description: |
|
||||
Analyze disk write patterns on the PVE host to assess wear and identify
|
||||
top writers by VM, k8s app, and PVC. Use when:
|
||||
(1) User asks about disk wear, disk writes, or storage health,
|
||||
(2) User says "what's wearing the disk", "disk analysis", "I/O analysis",
|
||||
(3) User wants to check write rates by VM, k8s namespace, or PVC,
|
||||
(4) Periodic quarterly disk health review.
|
||||
Combines PVE host I/O stats (SSH), Prometheus metrics (PromQL), and
|
||||
k8s PVC-to-pod mapping for a full breakdown.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-04-17
|
||||
---
|
||||
|
||||
# Disk Wear Analysis
|
||||
|
||||
## Infrastructure
|
||||
|
||||
| Resource | Address | Notes |
|
||||
|----------|---------|-------|
|
||||
| PVE host | `root@192.168.1.127` (SSH) | Dell R730, PERC H730 RAID |
|
||||
| Prometheus | `prometheus-server.monitoring.svc:80` | Query via alertmanager pod (wget) |
|
||||
| SSD | Slot 4, Samsung 850 EVO 1TB | Rated 150 TBW |
|
||||
| HDD sdc | RAID1 (2x 11.7TB SAS 7200RPM) | Main data disk, enterprise rated ~550 TB/yr |
|
||||
| HDD sda | 1.2TB SAS 10K RPM | Backup only |
|
||||
|
||||
## Step 1: Physical Disk Overview + SSD Health
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'echo "=== UPTIME ===" && uptime && echo "" && \
|
||||
echo "=== PHYSICAL DISK CUMULATIVE (since boot) ===" && iostat -d -k sda sdb sdc 2>/dev/null && echo "" && \
|
||||
echo "=== SSD SMART (Samsung 850 EVO, slot 4) ===" && \
|
||||
smartctl -d sat+megaraid,4 -A /dev/sda 2>/dev/null | grep -iE "power_on|reallocat|written|wear|pending|uncorrect"'
|
||||
```
|
||||
|
||||
**Interpret SSD health:**
|
||||
- `Wear_Leveling_Count`: 100 = new, 0 = dead. Calculate `(100 - value)%` wear used.
|
||||
- `Total_LBAs_Written`: multiply by 512 bytes for total TB written. Compare against 150 TBW rating.
|
||||
- Estimate remaining life: `(150 TBW - current TBW) / annual write rate`.
|
||||
|
||||
## Step 2: Real-Time Snapshot (30 seconds)
|
||||
|
||||
SSH to PVE host and take two reads of block device stats 30 seconds apart. This gives instantaneous write rates independent of Prometheus scrape intervals.
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.127 'bash -s' << 'SCRIPT'
|
||||
echo "=== 30-SECOND SNAPSHOT ($(date)) ==="
|
||||
declare -A snap1
|
||||
for dm in /sys/block/dm-*; do
|
||||
name=$(basename $dm)
|
||||
snap1[$name]=$(cat $dm/stat 2>/dev/null | awk '{print $7}')
|
||||
done
|
||||
for d in sda sdb sdc; do
|
||||
snap1[$d]=$(cat /sys/block/$d/stat 2>/dev/null | awk '{print $7}')
|
||||
done
|
||||
|
||||
sleep 30
|
||||
|
||||
printf "%-12s %10s %10s %s\n" "DEVICE" "kB/s" "GB/day" "NAME"
|
||||
echo "-------------------------------------------------------------------"
|
||||
results=""
|
||||
for dm in /sys/block/dm-*; do
|
||||
name=$(basename $dm)
|
||||
s2=$(cat $dm/stat 2>/dev/null | awk '{print $7}')
|
||||
s1=${snap1[$name]:-0}
|
||||
diff=$((s2 - s1))
|
||||
if [ "$diff" -gt 100 ]; then
|
||||
kbps=$((diff / 2 / 30))
|
||||
gbday=$(echo "scale=1; $kbps * 86400 / 1048576" | bc)
|
||||
lvm=$(dmsetup info --columns --noheadings -o name /dev/$name 2>/dev/null)
|
||||
results="$results\n$name $kbps $gbday $lvm"
|
||||
fi
|
||||
done
|
||||
for d in sda sdb sdc; do
|
||||
s2=$(cat /sys/block/$d/stat 2>/dev/null | awk '{print $7}')
|
||||
s1=${snap1[$d]:-0}
|
||||
diff=$((s2 - s1))
|
||||
kbps=$((diff / 2 / 30))
|
||||
gbday=$(echo "scale=1; $kbps * 86400 / 1048576" | bc)
|
||||
results="$results\n$d $kbps $gbday (physical)"
|
||||
done
|
||||
echo -e "$results" | sort -k2 -rn | head -30 | while read dev kbps gbday name; do
|
||||
printf "%-12s %8s kB/s %8s GB/day %s\n" "$dev" "$kbps" "$gbday" "$name"
|
||||
done
|
||||
SCRIPT
|
||||
```
|
||||
|
||||
## Step 3: Prometheus — Per-App Write Attribution
|
||||
|
||||
Query Prometheus from inside the cluster (alertmanager pod has wget).
|
||||
|
||||
### 3a. Top PVC Writers (1h rate)
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
|
||||
--post-data='query=topk(20,rate(node_disk_written_bytes_total{instance=~"pve.*"}[1h])*on(device)group_left(lv_name,vg_name)node_disk_device_mapper_info{instance=~"pve.*",lv_name=~"vm-9999-pvc-.*"})' \
|
||||
2>/dev/null | python3 -c "
|
||||
import json,sys
|
||||
d=json.load(sys.stdin)
|
||||
for r in d['data']['result']:
|
||||
m = r['metric']
|
||||
val = float(r['value'][1])
|
||||
gb_day = val * 86400 / 1073741824
|
||||
if gb_day > 0.05:
|
||||
lv = m.get('lv_name','?').replace('vm-9999-','')
|
||||
print(f'{gb_day:8.1f} GB/day {lv}')
|
||||
"
|
||||
```
|
||||
|
||||
Then enrich PVC UUIDs with names:
|
||||
```bash
|
||||
kubectl get pv -o custom-columns=NAME:.metadata.name,PVC:.spec.claimRef.name,NS:.spec.claimRef.namespace | grep "pvc-<UUID>"
|
||||
```
|
||||
|
||||
### 3b. Top VM Writers (1h rate)
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
|
||||
--post-data='query=topk(10,rate(node_disk_written_bytes_total{instance=~"pve.*"}[1h])*on(device)group_left(lv_name,vg_name)node_disk_device_mapper_info{instance=~"pve.*",lv_name!~"vm-9999-.*|root|swap|data.*|nfs.*|backup.*|ssd.*"})' \
|
||||
2>/dev/null | python3 -c "
|
||||
import json,sys
|
||||
d=json.load(sys.stdin)
|
||||
for r in d['data']['result']:
|
||||
m = r['metric']
|
||||
val = float(r['value'][1])
|
||||
gb_day = val * 86400 / 1073741824
|
||||
print(f'{gb_day:8.1f} GB/day {m.get(\"lv_name\",\"?\")}')
|
||||
"
|
||||
```
|
||||
|
||||
Enrich VM IDs with names:
|
||||
```bash
|
||||
ssh root@192.168.1.127 'qm list' 2>/dev/null
|
||||
```
|
||||
|
||||
### 3c. Aggregate PVC Writes by K8s Namespace
|
||||
|
||||
After collecting the top PVC writers from 3a, map each PVC UUID to its namespace using `kubectl get pv`, then sum by namespace. Present as a table:
|
||||
|
||||
| Namespace | GB/day | Top PVC |
|
||||
|-----------|--------|---------|
|
||||
| dbaas | ... | mysql-standalone, pg-cluster |
|
||||
| monitoring | ... | prometheus-data |
|
||||
|
||||
### 3d. Historical Trend (7-day total)
|
||||
|
||||
```bash
|
||||
kubectl exec -n monitoring prometheus-alertmanager-0 -- wget -qO- 'http://prometheus-server/api/v1/query' \
|
||||
--post-data='query=topk(10,increase(node_disk_written_bytes_total{instance=~"pve.*",device=~"sda|sdb|sdc"}[7d]))' \
|
||||
2>/dev/null | python3 -c "
|
||||
import json,sys
|
||||
d=json.load(sys.stdin)
|
||||
for r in d['data']['result']:
|
||||
m = r['metric']
|
||||
val = float(r['value'][1])
|
||||
tb = val / 1099511627776
|
||||
print(f'{tb:8.2f} TB/7d device={m.get(\"device\",\"?\")}')
|
||||
"
|
||||
```
|
||||
|
||||
## Step 4: Interpretation
|
||||
|
||||
### Baselines
|
||||
|
||||
| Metric | Healthy | Warning | Critical |
|
||||
|--------|---------|---------|----------|
|
||||
| sdc (HDD RAID1) annualized | <200 TB/yr | 200-400 TB/yr | >400 TB/yr |
|
||||
| sdb (SSD) wear used | <50% | 50-80% | >80% |
|
||||
| Single PVC write rate | <20 GB/day | 20-50 GB/day | >50 GB/day |
|
||||
| Single VM write rate | <50 GB/day | 50-100 GB/day | >100 GB/day |
|
||||
| NFS volume total | <20 GB/day | 20-50 GB/day | >50 GB/day |
|
||||
|
||||
### Known Write Sources (expected baseline, April 2026)
|
||||
|
||||
| Source | Expected GB/day | Notes |
|
||||
|--------|----------------|-------|
|
||||
| MySQL standalone | 5-10 | uptimekuma heartbeats + phpipam. `skip-log-bin`, no GR |
|
||||
| PostgreSQL cluster | 5-15 | Technitium DNS query logs (90-day retention) + app DBs |
|
||||
| k8s-master etcd | 30-50 | etcd WAL + snapshot compaction |
|
||||
| k8s-node VMs | 10-30 each | containerd layers, kubelet journals, ephemeral storage |
|
||||
| Prometheus | 3-5 | TSDB compaction |
|
||||
| home-assistant | 10-15 | Recorder database (SQLite/MariaDB) |
|
||||
| NFS volume | 5-10 | Minimal after TrueNAS deprecation |
|
||||
|
||||
### Red Flags (investigate immediately)
|
||||
|
||||
- Any single PVC >50 GB/day
|
||||
- MySQL `log_bin` = ON (should be OFF — `skip-log-bin` in standalone config)
|
||||
- Technitium MySQL or SQLite query log plugins re-installed (should be uninstalled)
|
||||
- NFS writes >30 GB/day (media ingestion or backup churn)
|
||||
- SSD wear >80% or projected life <2 years
|
||||
- k8s node VM writes >100 GB/day (something writing heavily to ephemeral storage)
|
||||
|
||||
## Step 5: Report Format
|
||||
|
||||
Present findings as three tables:
|
||||
|
||||
**1. Physical Disks**
|
||||
| Disk | Type | 7d Total | Rate GB/day | Annualized | Status |
|
||||
|------|------|----------|-------------|------------|--------|
|
||||
|
||||
**2. Top Writers (VMs + PVCs combined, sorted by rate)**
|
||||
| Rank | Name | Type | GB/day | Status | Notes |
|
||||
|------|------|------|--------|--------|-------|
|
||||
|
||||
**3. By K8s Namespace**
|
||||
| Namespace | PVC Writes GB/day | Top Contributor |
|
||||
|-----------|-------------------|-----------------|
|
||||
|
||||
End with:
|
||||
- Annualized wear projections
|
||||
- Comparison with previous run (if user provides one)
|
||||
- Action items for any WARNING/CRITICAL findings
|
||||
90
.claude/skills/extend-vm-storage/SKILL.md
Normal file
90
.claude/skills/extend-vm-storage/SKILL.md
Normal file
|
|
@ -0,0 +1,90 @@
|
|||
---
|
||||
name: extend-vm-storage
|
||||
description: |
|
||||
Extend disk storage on a Kubernetes node VM (Proxmox-hosted).
|
||||
Use when: (1) User wants to increase disk space on a k8s node VM,
|
||||
(2) A node is running low on disk, (3) User says "extend storage"
|
||||
or "add disk space". Automates: drain → shutdown → resize → boot →
|
||||
expand filesystem → uncordon.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-01
|
||||
---
|
||||
|
||||
# Extend VM Storage Skill
|
||||
|
||||
**Purpose**: Extend disk storage on a Kubernetes node VM (Proxmox-hosted).
|
||||
|
||||
**When to use**: User wants to increase disk space on a k8s node VM, or a node is running low on disk.
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Identify the Node
|
||||
|
||||
Ask the user which node needs more storage and how much to add.
|
||||
|
||||
Valid nodes: `k8s-master`, `k8s-node1`, `k8s-node2`, `k8s-node3`, `k8s-node4`
|
||||
|
||||
### 2. Run the Script
|
||||
|
||||
```bash
|
||||
./scripts/extend_vm_storage.sh <node-name> <size-increment>
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
./scripts/extend_vm_storage.sh k8s-node2 +64G
|
||||
```
|
||||
|
||||
### 3. What the Script Does
|
||||
|
||||
1. Validates inputs (node name and size format)
|
||||
2. Resolves node IP via kubectl
|
||||
3. Prompts for confirmation
|
||||
4. Drains the node (evicts pods)
|
||||
5. Shuts down the VM in Proxmox
|
||||
6. Resizes the disk (`scsi0`) by the given increment
|
||||
7. Starts the VM and waits for SSH
|
||||
8. Expands the filesystem inside the guest (auto-detects LVM vs direct partition)
|
||||
9. Uncordons the node
|
||||
10. Shows verification output (`df -h` and node status)
|
||||
|
||||
### 4. Update Terraform (if needed)
|
||||
|
||||
If you want Terraform to reflect the new disk size, update the VM definition in `main.tf` or `modules/create-vm/` so that a future `terraform apply` doesn't revert the change. Check if the VM disk size is managed by Terraform:
|
||||
|
||||
```bash
|
||||
grep -A5 "disk" main.tf | grep -i size
|
||||
```
|
||||
|
||||
If managed, update the size value to match the new total.
|
||||
|
||||
### 5. Verification
|
||||
|
||||
After the script completes, verify:
|
||||
```bash
|
||||
kubectl --kubeconfig $(pwd)/config get nodes
|
||||
ssh wizard@<node-ip> "df -h /"
|
||||
```
|
||||
|
||||
## Recovery
|
||||
|
||||
If the script fails mid-way:
|
||||
1. Check VM status: `ssh root@192.168.1.127 "qm status <vmid>"`
|
||||
2. Start VM if stopped: `ssh root@192.168.1.127 "qm start <vmid>"`
|
||||
3. Uncordon node: `kubectl --kubeconfig $(pwd)/config uncordon <node-name>`
|
||||
|
||||
## Constants
|
||||
|
||||
| Setting | Value |
|
||||
|---------|-------|
|
||||
| Proxmox host | `root@192.168.1.127` |
|
||||
| VM SSH user | `wizard` |
|
||||
| Disk name | `scsi0` |
|
||||
| Shutdown timeout | 300s |
|
||||
| SSH wait timeout | 300s |
|
||||
|
||||
## Questions to Ask User
|
||||
|
||||
1. Which node needs more storage?
|
||||
2. How much storage to add? (e.g., +64G)
|
||||
487
.claude/skills/home-assistant/SKILL.md
Normal file
487
.claude/skills/home-assistant/SKILL.md
Normal file
|
|
@ -0,0 +1,487 @@
|
|||
---
|
||||
name: home-assistant
|
||||
description: |
|
||||
Control Home Assistant smart home devices and automations. Use when:
|
||||
(1) User asks to turn on/off lights, switches, or devices,
|
||||
(2) User asks about the state of sensors, devices, or entities,
|
||||
(3) User says "turn on the lights", "set temperature", "lock the door",
|
||||
(4) User asks to run a scene or script,
|
||||
(5) User asks "what devices are on?" or "is the door locked?",
|
||||
(6) User mentions smart home, IoT, or home automation.
|
||||
There are TWO Home Assistant deployments: ha-london (default) and ha-sofia.
|
||||
Always use Home Assistant for smart home control.
|
||||
author: Claude Code
|
||||
version: 2.0.0
|
||||
date: 2026-02-07
|
||||
---
|
||||
|
||||
# Home Assistant Control
|
||||
|
||||
## Problem
|
||||
Need to control smart home devices, check sensor states, or run automations via Home Assistant.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- User asks to control lights, switches, covers, climate, etc.
|
||||
- User asks about device states ("is the light on?", "what's the temperature?")
|
||||
- User wants to run a scene or script
|
||||
- User mentions turning things on/off
|
||||
- User asks about smart home devices
|
||||
|
||||
## Deployments
|
||||
|
||||
There are **two** Home Assistant instances:
|
||||
|
||||
| Instance | URL | SSH | Default? |
|
||||
|----------|-----|-----|----------|
|
||||
| **ha-london** | `https://ha-london.viktorbarzin.me` | `ssh hassio@192.168.8.103` | Yes |
|
||||
| **ha-sofia** | `https://ha-sofia.viktorbarzin.me` | `ssh vbarzin@192.168.1.8` | No |
|
||||
|
||||
- **Default**: ha-london (use unless user specifies "sofia" or "ha-sofia")
|
||||
- **Aliases**: "ha" or "HA" = ha-london. "ha sofia" or "ha-sofia" = ha-sofia.
|
||||
|
||||
## Prerequisites
|
||||
- Python 3 with `requests` package available (installed via PYTHONPATH or system packages)
|
||||
- Environment variables for each instance:
|
||||
- **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN`
|
||||
- **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN`
|
||||
|
||||
## API Control
|
||||
|
||||
### Scripts
|
||||
|
||||
| Instance | Script |
|
||||
|----------|--------|
|
||||
| ha-london | `.claude/home-assistant.py` |
|
||||
| ha-sofia | `.claude/home-assistant-sofia.py` |
|
||||
|
||||
### Execution Pattern (CRITICAL)
|
||||
Run the scripts directly with python3 (env vars are set in the environment):
|
||||
|
||||
```bash
|
||||
# ha-london (default)
|
||||
python3 .claude/home-assistant.py [command] [options]
|
||||
|
||||
# ha-sofia
|
||||
python3 .claude/home-assistant-sofia.py [command] [options]
|
||||
```
|
||||
|
||||
### Available Commands
|
||||
|
||||
#### List Entities
|
||||
```bash
|
||||
# List all entities
|
||||
python .claude/home-assistant.py list
|
||||
|
||||
# List by domain
|
||||
python .claude/home-assistant.py list --domain light
|
||||
python .claude/home-assistant.py list --domain switch
|
||||
python .claude/home-assistant.py list --domain sensor
|
||||
python .claude/home-assistant.py list --domain climate
|
||||
python .claude/home-assistant.py list --domain cover
|
||||
|
||||
# JSON output
|
||||
python .claude/home-assistant.py list --json
|
||||
```
|
||||
|
||||
#### Search Entities
|
||||
```bash
|
||||
# Search by name or ID
|
||||
python .claude/home-assistant.py search "living room"
|
||||
python .claude/home-assistant.py search "temperature"
|
||||
python .claude/home-assistant.py search "door"
|
||||
```
|
||||
|
||||
#### Get Entity State
|
||||
```bash
|
||||
python .claude/home-assistant.py state light.living_room
|
||||
python .claude/home-assistant.py state sensor.temperature
|
||||
python .claude/home-assistant.py state --json light.living_room
|
||||
```
|
||||
|
||||
#### Control Entities
|
||||
```bash
|
||||
# Turn on/off
|
||||
python .claude/home-assistant.py on light.living_room
|
||||
python .claude/home-assistant.py off switch.tv
|
||||
python .claude/home-assistant.py toggle light.bedroom
|
||||
|
||||
# Set values
|
||||
python .claude/home-assistant.py set light.living_room 75 # brightness %
|
||||
python .claude/home-assistant.py set climate.thermostat 22 # temperature
|
||||
python .claude/home-assistant.py set cover.blinds 50 # position %
|
||||
python .claude/home-assistant.py set input_number.volume 80 # numeric value
|
||||
python .claude/home-assistant.py set input_boolean.away_mode on # boolean
|
||||
python .claude/home-assistant.py set input_select.mode "Night" # select option
|
||||
```
|
||||
|
||||
#### Run Scenes and Scripts
|
||||
```bash
|
||||
# Activate a scene
|
||||
python .claude/home-assistant.py scene movie_night
|
||||
python .claude/home-assistant.py scene scene.good_morning
|
||||
|
||||
# Run a script
|
||||
python .claude/home-assistant.py script bedtime_routine
|
||||
python .claude/home-assistant.py script script.welcome_home
|
||||
```
|
||||
|
||||
#### Call Any Service
|
||||
```bash
|
||||
# Generic service call
|
||||
python .claude/home-assistant.py service light turn_on --entity light.kitchen --data '{"brightness": 255}'
|
||||
python .claude/home-assistant.py service climate set_hvac_mode --entity climate.living_room --data '{"hvac_mode": "heat"}'
|
||||
python .claude/home-assistant.py service media_player play_media --entity media_player.tv --data '{"media_content_id": "...", "media_content_type": "video"}'
|
||||
```
|
||||
|
||||
#### List Services
|
||||
```bash
|
||||
# List all available services
|
||||
python .claude/home-assistant.py services
|
||||
|
||||
# Filter by domain
|
||||
python .claude/home-assistant.py services --domain light
|
||||
python .claude/home-assistant.py services --domain climate
|
||||
```
|
||||
|
||||
#### Send Notifications
|
||||
```bash
|
||||
python .claude/home-assistant.py notify "Door left open!"
|
||||
python .claude/home-assistant.py notify "Motion detected" --title "Security Alert"
|
||||
python .claude/home-assistant.py notify "Hello" --target notify.mobile_app
|
||||
```
|
||||
|
||||
## SSH Access (ha-sofia only)
|
||||
|
||||
ha-sofia supports SSH for direct configuration management.
|
||||
|
||||
### Connection
|
||||
```bash
|
||||
ssh vbarzin@192.168.1.8
|
||||
```
|
||||
|
||||
### Configuration Path
|
||||
```
|
||||
/config/
|
||||
```
|
||||
|
||||
### Common SSH Tasks
|
||||
```bash
|
||||
# Read configuration
|
||||
ssh vbarzin@192.168.1.8 "cat /config/configuration.yaml"
|
||||
|
||||
# Check HA logs (note: live log is inside HA Core container, not always accessible)
|
||||
ssh vbarzin@192.168.1.8 "tail -50 /config/home-assistant.log.1"
|
||||
|
||||
# List config files
|
||||
ssh vbarzin@192.168.1.8 "ls /config/*.yaml"
|
||||
|
||||
# Read automations/scenes/scripts
|
||||
ssh vbarzin@192.168.1.8 "cat /config/automations.yaml"
|
||||
ssh vbarzin@192.168.1.8 "cat /config/scenes.yaml"
|
||||
ssh vbarzin@192.168.1.8 "cat /config/scripts.yaml"
|
||||
|
||||
# Check secrets (keys only, not values)
|
||||
ssh vbarzin@192.168.1.8 "cat /config/secrets.yaml"
|
||||
```
|
||||
|
||||
### SSH Limitations
|
||||
- The SSH add-on runs in a separate container — `ha core logs` returns 401
|
||||
- Docker socket is not accessible — can't use `docker logs`
|
||||
- Live `home-assistant.log` may not be visible (written inside HA Core container)
|
||||
- Rotated logs (`.log.1`, `.log.old`) are accessible
|
||||
|
||||
## Complete Example
|
||||
|
||||
To turn on the living room light on ha-london:
|
||||
```bash
|
||||
python3 .claude/home-assistant.py on light.living_room
|
||||
```
|
||||
|
||||
To check ha-sofia configuration:
|
||||
```bash
|
||||
ssh vbarzin@ha-sofia.viktorbarzin.lan "cat /config/configuration.yaml"
|
||||
```
|
||||
|
||||
## Common Entity Domains
|
||||
|
||||
| Domain | Description | Common Actions |
|
||||
|--------|-------------|----------------|
|
||||
| `light` | Lights | on, off, toggle, set brightness |
|
||||
| `switch` | Switches | on, off, toggle |
|
||||
| `sensor` | Sensors | state (read-only) |
|
||||
| `binary_sensor` | Binary sensors | state (read-only) |
|
||||
| `climate` | Thermostats | set temperature, set mode |
|
||||
| `cover` | Blinds/covers | open, close, set position |
|
||||
| `lock` | Locks | lock, unlock |
|
||||
| `media_player` | Media devices | play, pause, volume |
|
||||
| `input_boolean` | Helper toggles | on, off |
|
||||
| `input_number` | Helper numbers | set value |
|
||||
| `input_select` | Helper dropdowns | select option |
|
||||
| `script` | Scripts | run |
|
||||
| `scene` | Scenes | activate |
|
||||
| `automation` | Automations | trigger, on, off |
|
||||
|
||||
## Verification
|
||||
- Commands print confirmation message on success
|
||||
- Use `state` command to verify entity changed
|
||||
- Exit code 0 = success, 1 = error
|
||||
|
||||
## Common Errors
|
||||
|
||||
| Error | Cause | Fix |
|
||||
|-------|-------|-----|
|
||||
| `HOME_ASSISTANT_URL and HOME_ASSISTANT_TOKEN must be set` | Env vars not set | Ensure `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN` are in the environment |
|
||||
| `404 Not Found` | Entity doesn't exist | Use `search` command to find correct entity ID |
|
||||
| `401 Unauthorized` | Token invalid/expired | Generate new long-lived token in HA |
|
||||
| `Connection refused` | HA not reachable | Check URL and network connectivity |
|
||||
|
||||
## Notes
|
||||
|
||||
1. **Entity IDs are case-sensitive** - use `search` to find exact IDs
|
||||
2. **Token must have sufficient permissions** - ensure token has access to all entities
|
||||
3. **Some entities require specific data** - use `services` command to see required fields
|
||||
4. **Two instances**: ha-london (default, K8s), ha-sofia (SSH + API)
|
||||
5. **ha-sofia SSH**: Uses default SSH key, user `vbarzin`, resolve DNS via `192.168.1.2`. Only reachable from local Sofia network (not remotely).
|
||||
|
||||
---
|
||||
|
||||
## ha-sofia Knowledge Map
|
||||
|
||||
### Overview
|
||||
- **1,087 entities** across 29 domains, **128 devices**, **13 areas**, **43 automations**
|
||||
- **Location**: Sofia, Bulgaria (Вермонт / Vermont neighborhood)
|
||||
- **4 tracked people**: Viktor Barzin, Emil Barzin, Valia Barzina, MQTT
|
||||
|
||||
### Key Systems
|
||||
|
||||
#### 1. Heating & Gas Boiler (EMS-ESP)
|
||||
- Buderus/Bosch gas boiler via EMS-ESP integration
|
||||
- Entities: `sensor.boiler_*`, `number.boiler_*`, `switch.boiler_*`
|
||||
- DHW (hot water), heating curves, burner stats, gas metering
|
||||
- Outside temp: `sensor.boiler_outside_temperature`
|
||||
|
||||
#### 2. Climate / Thermostats (4 rooms + bathroom)
|
||||
| Room | Entity | Bulgarian |
|
||||
|------|--------|-----------|
|
||||
| Children's room | `climate.thermostat_children_room` | Детска |
|
||||
| Office | `climate.thermostat_office_room` | Кабинет |
|
||||
| Living room | `climate.thermostat_living_room` | Хол |
|
||||
| Master bedroom | `climate.thermostat_master_bedroom` | род. Спалня |
|
||||
| Bathroom (Valchedram) | `climate.bania_vlchedrm` | Баня Вълчедръм |
|
||||
|
||||
#### 3. Solar / Photovoltaic (Solarman)
|
||||
- Inverter: `sensor.fv_b_*` (FV = фотоволтаици)
|
||||
- Battery, grid/self-use EMS mode, solar forecast
|
||||
- Energy totals tracked per grid/inverter
|
||||
|
||||
#### 4. ATS (Automatic Transfer Switch)
|
||||
- Grid ↔ inverter switching: `sensor.ats_*`
|
||||
- Load power, grid/inverter voltage, energy totals
|
||||
|
||||
#### 5. Security / Alarm (Paradox EVOHD+)
|
||||
- 3 alarm partitions: Apartment, Garage, Valchedram
|
||||
- PIR zones, door contacts, tamper sensors, PGMs for garage doors/doorbells
|
||||
|
||||
#### 6. Cameras / NVR / Frigate
|
||||
- Hikvision NVR (DS-7632NXI) with 9 cameras
|
||||
- Frigate NVR with object detection:
|
||||
- **Vermont** (home): cameras 10, 15, 16 — car/plate recognition
|
||||
- **Valchedram** (country): cameras 1, 2 — person detection
|
||||
- Object tracking: vehicles (Emo Skoda), cats (Мичка)
|
||||
|
||||
#### 7. Smart Appliances (Home Connect / Bosch-Siemens)
|
||||
| Appliance | Entity prefix | Bulgarian |
|
||||
|-----------|--------------|-----------|
|
||||
| Dishwasher | `*.miialna_mashina_*` | Миялна машина |
|
||||
| Washing machine | `*.peralnia_*` | Пералня (with i-Dos) |
|
||||
| Dryer | `*.sushilnia_*` | Сушилня |
|
||||
|
||||
#### 8. LED Strip Controllers (6-channel each)
|
||||
- Kitchen upper/lower: `light.kukhnia_*_socket_1-6`
|
||||
- Children's wardrobe: `light.led_detska_garderob_socket_1-6`
|
||||
- Hall wardrobe: `light.led_garderob_khol_socket_1-6`
|
||||
- Corridor wardrobe: `light.led_garderob_koridor_socket_1-6` (offline)
|
||||
- Master bedroom wardrobe: `light.led_garderob_rod_spalnia_socket_1-6` (offline)
|
||||
|
||||
#### 9. Media
|
||||
- Sony BRAVIA XR-65A80L (AirPlay + DLNA)
|
||||
- Marantz ND8006 (AirPlay + DLNA)
|
||||
|
||||
#### 10. Networking
|
||||
- TP-Link Archer AX6000 (main router)
|
||||
- TP-Link Archer MR200 (LTE backup)
|
||||
|
||||
#### 11. UPS
|
||||
- `sensor.ups_*` — battery, load, voltage, remaining time
|
||||
|
||||
#### 12. Ventilation (Pax BLE)
|
||||
- `sensor.ventilator_mokro_2_*` — bathroom fan with humidity/light sensors
|
||||
|
||||
#### 13. Synology NAS
|
||||
- **NAS_Barzini**: CPU 2%, Memory 26%, 2 drives (39C/41C)
|
||||
- Volume 1: 87.2% used (5.02 TB), status "attention"
|
||||
- DSM update available
|
||||
|
||||
#### 14. Printer
|
||||
- **HP ColorLaserJet M253-M254**: Black 49%, Cyan 88%, Magenta 91%, Yellow 90%
|
||||
|
||||
#### 15. Dell R730 Server (via iDRAC)
|
||||
- CPU temp 57C, Power 192W, Inlet 24C, Exhaust 29C
|
||||
- Tesla T4 GPU: 41C, 4% util, 4183MB VRAM, 32W
|
||||
|
||||
#### 16. Other Devices
|
||||
- **Dehumidifier** (Tuya): `humidifier.arete_*`
|
||||
- **Robot vacuum** (Rumi): `vacuum.rumi` — docked, 100% battery, 227 missions
|
||||
- **Tuya lights**: `light.krushka_*` (4 bulbs, currently offline)
|
||||
- **AC unit** (MELCloud): `climate.klimatik` — off, 23C
|
||||
- **Mistral AI**: Conversation integration (Devstral 2)
|
||||
|
||||
### Integrations
|
||||
HACS, ESPHome, Frigate, Home Connect, Paradox (PAI), Solarman, Pax BLE, Hikvision, InfluxDB, Mosquitto MQTT, Node-RED, Music Assistant, Zigbee2MQTT, Spook, Xtend Tuya, MELCloud, Synology DSM, HP Printer (IPP)
|
||||
|
||||
### Add-ons
|
||||
Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Frigate, PAI, Music Assistant, ESPHome, Ookla Speedtest, HA USB/IP Client, **Home Assistant Version Control**
|
||||
|
||||
### Version Control (Git Config Tracking)
|
||||
- **Add-on**: Home Assistant Version Control v1.2.0 (slug: `4ab554b2_home-assistant-version-control`)
|
||||
- **Add-on repo**: `https://github.com/saihgupr/ha-addons`
|
||||
- **What it does**: Auto-tracks every config file change via git. File watcher (inotify) detects changes, debounces (5s default), commits automatically.
|
||||
- **Tracked files**: `.yaml`, `.yml`, `.json`, `.conf`, `.sh`, `.py` + `.storage/` (lovelace dashboards, entity/device registries, config entries)
|
||||
- **Excluded**: `secrets.yaml`, database files (`.db`), logs, `__pycache__`, binary files
|
||||
- **Git repo**: `/homeassistant/.git` (owned by root; SSH user needs `git config --global --add safe.directory /homeassistant`)
|
||||
- **GitHub remote**: `https://github.com/ViktorBarzin/ha-sofia-config` (private). Auth token from Vault `secret/viktor` key `github_pat`. Cloud sync pushes hourly.
|
||||
- **Web UI**: Sidebar → "Version Control", or Settings → Add-ons → HA Version Control → Open Web UI. Ingress URL: `/api/hassio_ingress/PYR_EdVzPtzZdRnGjrhI3qbGogCVJ18FrtOg6oaBf-w/`
|
||||
- **Features**: Browse commit history with diffs, restore individual files or full config to any point, delete recovery, smart reloads after restore
|
||||
- **API**: `POST /api/git/add-all-and-commit` (manual backup), `GET /api/git/history` (commit log), `POST /api/restore-file` (restore single file), `POST /api/restore-commit` (full rollback)
|
||||
- **SSH git access**: `ssh vbarzin@192.168.1.8 'git -C /homeassistant log --oneline -10'`
|
||||
|
||||
### Music Assistant (MASS)
|
||||
- **Addon slug**: `d5369777_music_assistant`
|
||||
- **Version**: 2.7.8
|
||||
- **Web UI**: `http://192.168.1.8:8095`
|
||||
- **Container name**: `addon_d5369777_music_assistant`
|
||||
- **Providers**: Spotify (OAuth PKCE + librespot), TuneIn Radio, RadioBrowser, BBC Sounds, Radio Paradise, Filesystem (remote share)
|
||||
- **Player providers**: UPnP/DLNA, AirPlay, Sendspin (port 8927)
|
||||
- **Registered players**: Marantz ND8006 (DLNA + AirPlay), Sony BRAVIA XR-65A80L (AirPlay), Web (Chrome)
|
||||
- **Librespot cache**: `/data/.cache/spotify--5s3mSP8y/credentials.json` (inside addon container)
|
||||
- **Troubleshooting**: See skill `music-assistant-librespot-wrong-account` for Spotify playback failures
|
||||
- **SSH addon access to container**: `sudo curl -s --unix-socket /run/docker.sock http://localhost/containers/<id>/exec` (requires sudo)
|
||||
|
||||
### Zones
|
||||
- **Вермонт** (Vermont) — Home
|
||||
- **Вълчедръм** (Valchedram) — Country house
|
||||
|
||||
### Bulgarian ↔ English Room Names
|
||||
| Bulgarian | English | Entity prefix |
|
||||
|-----------|---------|---------------|
|
||||
| Детска | Children's room | `detska` |
|
||||
| Кабинет | Office | `kabinet` |
|
||||
| Хол | Living room | `khol` |
|
||||
| Спалня / род. Спалня | Master bedroom | `rod_spalnia` |
|
||||
| Кухня | Kitchen | `kukhnia` |
|
||||
| Коридор | Corridor | `koridor` |
|
||||
| Баня | Bathroom | `bania` |
|
||||
| Гараж | Garage | `garaj` |
|
||||
| Мазе | Basement | `maze` |
|
||||
|
||||
---
|
||||
|
||||
## ha-london Knowledge Map
|
||||
|
||||
### Overview
|
||||
- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi)
|
||||
- **Location**: London, UK
|
||||
- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone)
|
||||
- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access)
|
||||
- **Config path**: `/config/` (requires `sudo` for file access)
|
||||
- **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea
|
||||
- **Zone**: London (home)
|
||||
|
||||
### Key Systems
|
||||
|
||||
#### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring
|
||||
Named plugs with power/energy tracking:
|
||||
|
||||
| Name | Entity | Usage/month | Purpose |
|
||||
|------|--------|-------------|---------|
|
||||
| Thor | `switch.thor` | 6.4 kWh | Server/NAS |
|
||||
| Pikkachu | `switch.pikkachu` | 4.8 kWh | Water cooler |
|
||||
| Michelle | `switch.emeter_plug` | 0.3 kWh | — |
|
||||
| Livia | `switch.livia` | 0.07 kWh | — |
|
||||
| Jinx | `switch.jinx` | 0.02 kWh | — |
|
||||
| Projector plug | `switch.tapo_p100` | unavailable | Tapo P100 |
|
||||
|
||||
#### 2. Air Quality (Apollo AIR-1 via ESPHome)
|
||||
- `sensor.apollo_air_1_fa2d34_co2`: CO2 level
|
||||
- `sensor.apollo_air_1_fa2d34_sen55_temperature`: Temperature
|
||||
- `sensor.apollo_air_1_fa2d34_sen55_humidity`: Humidity
|
||||
- PM1.0/2.5/4.0/10 particulate sensors
|
||||
- VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors
|
||||
|
||||
#### 3. Cowboy E-Bike
|
||||
- `sensor.bike_state_of_charge`: Battery %
|
||||
- `sensor.bike_total_distance`: Total km
|
||||
- `sensor.bike_total_co2_saved`: CO2 saved (grams)
|
||||
|
||||
#### 4. Uptime Monitoring (UptimeRobot)
|
||||
- `sensor.blog`: blog uptime
|
||||
- `sensor.valchedrym`: Valchedram site uptime
|
||||
- `switch.blog`, `switch.valchedrym`: monitoring toggles
|
||||
|
||||
#### 5. Oral-B Toothbrush (BLE)
|
||||
- `sensor.smart_series_6000_83d3_*`: mode, pressure, sector, time
|
||||
|
||||
#### 6. Network Device Tracking (~100 devices)
|
||||
- Router-based MAC tracking (many unnamed)
|
||||
- Named: Viktor's iPhone15Pro, Anca's iPhone13Pro, Apple Watch, Amazon Fire, iRobot, Portal, Living-Room TV
|
||||
|
||||
#### 7. Media & Entertainment
|
||||
- Projector + debug bridge: unavailable (Tapo plug off)
|
||||
- Scripts: `script.start_netflix`, `script.start_stremio`
|
||||
- Scene: `scene.night` (turns off Livia + Michelle plugs)
|
||||
|
||||
### Custom Components
|
||||
- **cowboy**: Cowboy e-bike integration (HACS)
|
||||
- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS)
|
||||
|
||||
### Integrations
|
||||
ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB
|
||||
|
||||
### AI / Voice Assistants
|
||||
- 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air
|
||||
- Local voice: Piper (TTS) + Whisper (STT)
|
||||
- Google Translate TTS
|
||||
|
||||
### Automations (10)
|
||||
- Water cooler on/off scheduling (07:00 on, 00:30 off)
|
||||
- Michelle plug auto-off when idle (<70W)
|
||||
- Apollo AIR-1 RGB LED: CO2 indicator (on in morning, off at 22:00)
|
||||
- Cowboy e-bike low battery notification (ntfy + iPhone push)
|
||||
- Anca arrival/departure notifications
|
||||
- Night scene: turns off Livia + Michelle
|
||||
|
||||
### Docker Setup
|
||||
```bash
|
||||
docker run -d --name homeassistant --privileged \
|
||||
-e TZ=Europe/London \
|
||||
-v /home/pi/docker/homeAssistant:/config \
|
||||
-v /run/dbus:/run/dbus:ro \
|
||||
--network=host --restart=unless-stopped \
|
||||
homeassistant/home-assistant:2025.9
|
||||
```
|
||||
|
||||
### SSH Access
|
||||
```bash
|
||||
# Read config
|
||||
ssh hassio@192.168.8.103 "sudo cat /config/configuration.yaml"
|
||||
|
||||
# Check logs
|
||||
ssh hassio@192.168.8.103 "sudo docker logs homeassistant --tail 50"
|
||||
|
||||
# Restart HA via API (preferred)
|
||||
curl -s -X POST "http://192.168.8.103:8123/api/services/homeassistant/restart" \
|
||||
-H "Authorization: Bearer ${HOME_ASSISTANT_LONDON_TOKEN}"
|
||||
|
||||
# View Docker logs
|
||||
ssh hassio@192.168.8.103 "sudo docker logs homeassistant --tail 50"
|
||||
```
|
||||
151
.claude/skills/k8s-ndots-search-domain-nxdomain-flood/SKILL.md
Normal file
151
.claude/skills/k8s-ndots-search-domain-nxdomain-flood/SKILL.md
Normal file
|
|
@ -0,0 +1,151 @@
|
|||
---
|
||||
name: k8s-ndots-search-domain-nxdomain-flood
|
||||
description: |
|
||||
Fix for massive NxDomain query floods to external DNS servers caused by Kubernetes
|
||||
ndots:5 search domain expansion. Use when: (1) DNS server shows low cache hit rate
|
||||
with 60%+ NxDomain responses, (2) DNS logs show queries like
|
||||
"service.namespace.svc.cluster.local.yourdomain.lan", (3) external DNS receives
|
||||
thousands of junk queries per hour for non-existent names ending in your search
|
||||
domain, (4) DNS cache hit ratio is unexpectedly low despite stable workloads.
|
||||
Applies to any Kubernetes cluster using CoreDNS with a custom DNS search domain.
|
||||
author: Claude Code
|
||||
version: 1.1.0
|
||||
date: 2026-02-17
|
||||
---
|
||||
|
||||
# Kubernetes ndots:5 Search Domain NxDomain Flood
|
||||
|
||||
## Problem
|
||||
Kubernetes pods have `ndots:5` and a custom search domain (e.g., `viktorbarzin.lan`)
|
||||
in their `/etc/resolv.conf`. When resolving internal service names like
|
||||
`redis.redis.svc.cluster.local` (4 dots < ndots:5), glibc tries all search domain
|
||||
suffixes before the absolute name. This generates queries like:
|
||||
|
||||
1. `redis.redis.svc.cluster.local.namespace.svc.cluster.local` (CoreDNS handles, NxDomain)
|
||||
2. `redis.redis.svc.cluster.local.svc.cluster.local` (CoreDNS handles, NxDomain)
|
||||
3. `redis.redis.svc.cluster.local.cluster.local` (CoreDNS handles, NxDomain)
|
||||
4. `redis.redis.svc.cluster.local.yourdomain.lan` (CoreDNS **forwards to external DNS**, NxDomain)
|
||||
5. `redis.redis.svc.cluster.local` (finally resolves)
|
||||
|
||||
Step 4 is the problem: CoreDNS forwards `*.yourdomain.lan` queries to the external DNS
|
||||
server, flooding it with junk NxDomain requests. With hundreds of pods making DNS lookups,
|
||||
this generates tens of thousands of useless queries per day.
|
||||
|
||||
## Context / Trigger Conditions
|
||||
- DNS server (e.g., Technitium, Pi-hole, BIND) shows high NxDomain percentage (50%+)
|
||||
- DNS cache hit rate is unexpectedly low
|
||||
- DNS logs show queries ending in `*.svc.cluster.local.yourdomain.lan`
|
||||
- CoreDNS Corefile has a server block forwarding `yourdomain.lan` to an external DNS
|
||||
- Node resolv.conf has `search yourdomain.lan` (set by DHCP)
|
||||
- Top DNS clients by query volume are Kubernetes node IPs (not pod IPs), because
|
||||
CoreDNS forwards via NodePort and the source IP becomes the node IP
|
||||
|
||||
## Solution
|
||||
|
||||
### Step 1: Confirm the problem
|
||||
Check DNS query logs for the pattern:
|
||||
```bash
|
||||
# Enable Technitium query logging temporarily
|
||||
# API: /api/settings/set?token=TOKEN&enableLogging=true&logQueries=true&loggingType=File
|
||||
|
||||
# Check for junk queries
|
||||
kubectl exec -n technitium PODNAME -- grep "cluster.local.yourdomain" /etc/dns/logs/*.log
|
||||
```
|
||||
|
||||
### Step 2: Add generic CoreDNS template regex (RECOMMENDED)
|
||||
|
||||
Instead of creating specific catch-all blocks for each junk suffix pattern, add a single
|
||||
`template` directive with a regex inside the `yourdomain.lan` server block. This catches
|
||||
ALL multi-label junk queries (e.g., `*.cluster.local.yourdomain.lan`,
|
||||
`*.yourdomain.lan.yourdomain.lan`, `www.cloudflare.com.yourdomain.lan`) in one rule:
|
||||
|
||||
```
|
||||
yourdomain.lan:53 {
|
||||
errors
|
||||
template ANY ANY yourdomain.lan {
|
||||
match ".*\..*\.yourdomain\.lan\.$"
|
||||
rcode NXDOMAIN
|
||||
fallthrough
|
||||
}
|
||||
forward . <your-dns-server-ip>
|
||||
cache {
|
||||
success 10000 300 6
|
||||
denial 10000 300 60
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**How it works**: The regex `.*\..*\.yourdomain\.lan\.$` matches any query with 2+ labels
|
||||
before `.yourdomain.lan` — meaning only single-label queries like `idrac.yourdomain.lan`
|
||||
fall through to the real DNS server. All junk multi-label queries get instant NXDOMAIN.
|
||||
|
||||
**Important**: The `fallthrough` directive is required so that legitimate single-label
|
||||
queries (which don't match the regex) continue to the `forward` plugin.
|
||||
|
||||
#### Alternative: Specific catch-all blocks (DEPRECATED)
|
||||
|
||||
The older approach used separate server blocks per junk suffix pattern:
|
||||
|
||||
```
|
||||
cluster.local.yourdomain.lan:53 {
|
||||
errors
|
||||
template ANY ANY {
|
||||
rcode NXDOMAIN
|
||||
}
|
||||
cache {
|
||||
denial 10000 3600
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This requires adding a new block for each pattern and doesn't catch arbitrary junk queries
|
||||
like `www.cloudflare.com.yourdomain.lan`. The generic regex approach above is preferred.
|
||||
|
||||
### Step 3: Apply the CoreDNS ConfigMap
|
||||
```bash
|
||||
kubectl apply -f coredns-configmap.yaml
|
||||
# CoreDNS auto-reloads via the `reload` plugin (default 30s)
|
||||
```
|
||||
|
||||
### Step 4: Manage in Terraform (this cluster)
|
||||
The CoreDNS ConfigMap is managed in `modules/kubernetes/technitium/main.tf` as
|
||||
`kubernetes_config_map.coredns`. To import an existing ConfigMap:
|
||||
```bash
|
||||
terraform import 'module.kubernetes_cluster.module.technitium["technitium"].kubernetes_config_map.coredns' 'kube-system/coredns'
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Test that the template returns NXDOMAIN instantly:
|
||||
```bash
|
||||
kubectl run dns-test --rm -i --restart=Never --image=busybox -- \
|
||||
nslookup redis.redis.svc.cluster.local.yourdomain.lan 10.96.0.10
|
||||
# Should return NXDOMAIN immediately
|
||||
```
|
||||
|
||||
2. Check DNS logs - no more `*.cluster.local.yourdomain.lan` queries to external DNS
|
||||
3. NxDomain percentage on external DNS should drop significantly within an hour
|
||||
|
||||
## Additional Fix: Enable DNS Cache Persistence
|
||||
If the DNS server (Technitium) loses its cache on pod restart, enable `saveCache`:
|
||||
```
|
||||
/api/settings/set?token=TOKEN&saveCache=true
|
||||
```
|
||||
This prevents the cache hit rate from resetting to zero after every restart.
|
||||
|
||||
## Notes
|
||||
- The same `ndots:5` issue also causes `*.yourdomain.lan.yourdomain.lan` (double suffix)
|
||||
and `*.yourdomain.me.yourdomain.lan` patterns — the generic regex catches all of these
|
||||
- The top DNS client IPs will be the **node IPs** (not pod IPs) because CoreDNS forwards
|
||||
via NodePort, and the source becomes the node's IP
|
||||
- `ndots:5` is the Kubernetes default and shouldn't be changed cluster-wide as it breaks
|
||||
short-name service resolution
|
||||
- Individual pods can set `dnsConfig.options: [{name: ndots, value: "2"}]` to reduce
|
||||
search domain lookups, but this is a per-pod opt-in
|
||||
- Prometheus scrape targets using `.yourdomain.lan` hostnames should add a trailing dot
|
||||
(e.g., `idrac.yourdomain.lan.:161`) to bypass ndots expansion entirely
|
||||
- ExternalName services don't need trailing dots — the generic template regex handles them
|
||||
|
||||
## See also
|
||||
- `pfsense-dnsmasq-interface-binding` — Related: preserve client IPs for DNS port forwarding
|
||||
- `crowdsec-agent-registration-failure` — another common K8s DNS-adjacent issue
|
||||
- `loki-helm-deployment-pitfalls` — Loki deployment patterns
|
||||
194
.claude/skills/pfsense/SKILL.md
Normal file
194
.claude/skills/pfsense/SKILL.md
Normal file
|
|
@ -0,0 +1,194 @@
|
|||
---
|
||||
name: pfsense
|
||||
description: |
|
||||
Manage the pfSense firewall at 10.0.20.1 via SSH. Use when:
|
||||
(1) User asks about firewall rules, NAT, port forwarding,
|
||||
(2) User asks about network diagnostics (ARP, routing, DNS, ping),
|
||||
(3) User asks about DHCP leases or static mappings,
|
||||
(4) User asks about VPN status (WireGuard, Tailscale),
|
||||
(5) User asks about pfSense services (Snort, FRR/BGP/OSPF, etc.),
|
||||
(6) User asks about firewall states, connections, or traffic,
|
||||
(7) User mentions "pfsense", "firewall", "gateway", or network troubleshooting,
|
||||
(8) User wants to check system health (CPU, memory, disk, temp) of pfSense.
|
||||
pfSense CE 2.7.2 on FreeBSD 14.0, VMID 101 on Proxmox.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-14
|
||||
---
|
||||
|
||||
# pfSense Firewall Management
|
||||
|
||||
## Overview
|
||||
- **Host**: `10.0.20.1` (Kubernetes VLAN gateway)
|
||||
- **SSH**: `ssh admin@10.0.20.1`
|
||||
- **Version**: pfSense CE 2.7.2, FreeBSD 14.0
|
||||
- **Proxmox VMID**: 101 (8 CPU, 16GB RAM, 32G disk)
|
||||
- **Web UI**: `https://pfsense.viktorbarzin.me` (via reverse proxy) or `https://10.0.20.1`
|
||||
- **Installed packages**: FRR (BGP/OSPF), Tailscale, Snort, WireGuard, REST API, FreeRADIUS
|
||||
|
||||
## Interfaces
|
||||
|
||||
| Name | Description | Physical | IP | Network |
|
||||
|------|-------------|----------|-----|---------|
|
||||
| wan | WAN | vtnet0 | 192.168.1.2/24 | Physical network |
|
||||
| lan | Management VMs | vtnet1 | 10.0.10.1/24 | VLAN 10 |
|
||||
| opt1 | Kubernetes | vtnet2 | 10.0.20.1/24 | VLAN 20 |
|
||||
| opt2 | WireGuard | tun_wg0 | 10.3.2.1/24 | VPN tunnel |
|
||||
| tailscale0 | Tailscale | tailscale0 | 100.64.0.x | Headscale mesh |
|
||||
|
||||
## CLI Script
|
||||
|
||||
**Script**: `.claude/pfsense.py`
|
||||
|
||||
### Execution Pattern
|
||||
```bash
|
||||
cd ~/code/infra && python3 .claude/pfsense.py <command> [options]
|
||||
```
|
||||
|
||||
### Available Commands
|
||||
|
||||
#### System Information
|
||||
```bash
|
||||
python3 .claude/pfsense.py status # Full system overview
|
||||
python3 .claude/pfsense.py uptime # Uptime
|
||||
python3 .claude/pfsense.py cpu # CPU info and load
|
||||
python3 .claude/pfsense.py memory # Memory breakdown
|
||||
python3 .claude/pfsense.py disk # Disk usage
|
||||
python3 .claude/pfsense.py temp # CPU temperature
|
||||
python3 .claude/pfsense.py pkg-list # Installed packages
|
||||
```
|
||||
|
||||
#### Network & Interfaces
|
||||
```bash
|
||||
python3 .claude/pfsense.py interfaces # Interface list with IPs
|
||||
python3 .claude/pfsense.py gateways # Gateway status
|
||||
python3 .claude/pfsense.py arp # ARP table
|
||||
python3 .claude/pfsense.py routes # Routing table
|
||||
python3 .claude/pfsense.py dns-resolve <host> # DNS lookup via pfSense
|
||||
python3 .claude/pfsense.py diag <host> # Ping test
|
||||
```
|
||||
|
||||
#### Firewall
|
||||
```bash
|
||||
python3 .claude/pfsense.py rules # All firewall rules
|
||||
python3 .claude/pfsense.py rules opt1 # Rules for Kubernetes interface
|
||||
python3 .claude/pfsense.py nat # NAT / port forwarding rules
|
||||
python3 .claude/pfsense.py aliases # List all aliases
|
||||
python3 .claude/pfsense.py alias <name> # Show alias members
|
||||
python3 .claude/pfsense.py states # State table summary
|
||||
python3 .claude/pfsense.py states-top 20 # Top 20 IPs by connection count
|
||||
```
|
||||
|
||||
#### DHCP
|
||||
```bash
|
||||
python3 .claude/pfsense.py dhcp-leases # All DHCP leases
|
||||
python3 .claude/pfsense.py dhcp-leases opt1 # Kubernetes network leases only
|
||||
```
|
||||
|
||||
#### Services
|
||||
```bash
|
||||
python3 .claude/pfsense.py services # List all services + status
|
||||
python3 .claude/pfsense.py service restart snort # Restart a service
|
||||
python3 .claude/pfsense.py service stop wireguard # Stop a service
|
||||
python3 .claude/pfsense.py service start wireguard # Start a service
|
||||
```
|
||||
|
||||
#### VPN & Routing
|
||||
```bash
|
||||
python3 .claude/pfsense.py wireguard # WireGuard tunnel status
|
||||
python3 .claude/pfsense.py tailscale # Tailscale/Headscale status
|
||||
python3 .claude/pfsense.py bgp # BGP summary (FRR)
|
||||
python3 .claude/pfsense.py ospf # OSPF neighbors (FRR)
|
||||
```
|
||||
|
||||
#### Security
|
||||
```bash
|
||||
python3 .claude/pfsense.py snort # Snort IDS status + recent alerts
|
||||
python3 .claude/pfsense.py logs # Last 50 firewall log entries
|
||||
python3 .claude/pfsense.py logs 200 # Last 200 entries
|
||||
python3 .claude/pfsense.py logs-filter "blocked" # Search logs
|
||||
```
|
||||
|
||||
#### Advanced
|
||||
```bash
|
||||
python3 .claude/pfsense.py pfctl "-sr" # Raw pfctl command
|
||||
python3 .claude/pfsense.py php "echo phpversion();" # Run PHP on pfSense
|
||||
python3 .claude/pfsense.py raw "ls /tmp" # Run arbitrary shell command
|
||||
python3 .claude/pfsense.py backup # Dump config.xml to stdout
|
||||
```
|
||||
|
||||
## Direct SSH Access
|
||||
|
||||
For tasks not covered by the script, SSH directly:
|
||||
```bash
|
||||
ssh admin@10.0.20.1 "<command>"
|
||||
```
|
||||
|
||||
### Useful Direct Commands
|
||||
```bash
|
||||
# pfSense PHP shell (interactive config access)
|
||||
ssh admin@10.0.20.1 "php -r 'require_once(\"config.inc\"); \$cfg = parse_config(true); echo json_encode(\$cfg[\"nat\"], JSON_PRETTY_PRINT);'"
|
||||
|
||||
# pfSsh.php playback commands
|
||||
ssh admin@10.0.20.1 "pfSsh.php playback gatewaystatus"
|
||||
ssh admin@10.0.20.1 "pfSsh.php playback svc restart snort"
|
||||
ssh admin@10.0.20.1 "pfSsh.php playback listpkg"
|
||||
|
||||
# Config sections via PHP
|
||||
ssh admin@10.0.20.1 "php -r 'require_once(\"config.inc\"); \$cfg = parse_config(true); print_r(\$cfg[\"filter\"][\"rule\"][0]);'"
|
||||
|
||||
# FRR/vtysh for routing
|
||||
ssh admin@10.0.20.1 "/usr/local/bin/vtysh -c 'show ip route'"
|
||||
ssh admin@10.0.20.1 "/usr/local/bin/vtysh -c 'show bgp ipv4 unicast'"
|
||||
```
|
||||
|
||||
## REST API (pfSense-pkg-RESTAPI v2.2)
|
||||
|
||||
The REST API package is installed but **no API keys are configured**. To use it:
|
||||
1. Create an API key in pfSense Web UI: System > REST API > Settings > Keys
|
||||
2. Use Bearer token auth: `curl -sk https://10.0.20.1/api/v2/status/system -H 'Authorization: Bearer <key>'`
|
||||
|
||||
Until API keys are set up, use SSH for all operations.
|
||||
|
||||
## Key Services
|
||||
|
||||
| Service | Status | Notes |
|
||||
|---------|--------|-------|
|
||||
| FRR (BGP/OSPF) | Running | Routing daemon |
|
||||
| Snort | Running | IDS/IPS |
|
||||
| WireGuard | Running | VPN tunnel (10.3.2.0/24) |
|
||||
| Tailscale | Running | Mesh VPN via Headscale |
|
||||
| FreeRADIUS | Running | RADIUS auth |
|
||||
| DHCP (Kea) | Running | kea-dhcp4 |
|
||||
| SSH | Running | Admin access |
|
||||
| NTP | Running | Time sync |
|
||||
|
||||
## Firewall Stats
|
||||
- **167 firewall rules** (pfctl -sr)
|
||||
- **154 NAT rules** (pfctl -sn)
|
||||
- **~784 active states** (varies)
|
||||
- **10 aliases** (LAN, OPT1, OPT2, WAN networks + custom)
|
||||
|
||||
## NFS Backup
|
||||
Config backups stored at NFS: `/mnt/main/pfsense-backup`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Issue | Command |
|
||||
|-------|---------|
|
||||
| Can't reach internet from K8s | `python3 .claude/pfsense.py gateways` + `python3 .claude/pfsense.py diag 8.8.8.8` |
|
||||
| K8s pod can't reach external | `python3 .claude/pfsense.py rules opt1` + check NAT |
|
||||
| DHCP not working | `python3 .claude/pfsense.py dhcp-leases opt1` + `python3 .claude/pfsense.py service restart kea-dhcp4` |
|
||||
| High connection count | `python3 .claude/pfsense.py states-top 20` |
|
||||
| Snort blocking traffic | `python3 .claude/pfsense.py snort` + check alerts |
|
||||
| DNS resolution failing | `python3 .claude/pfsense.py dns-resolve <host>` |
|
||||
| BGP/OSPF routes missing | `python3 .claude/pfsense.py bgp` or `python3 .claude/pfsense.py ospf` |
|
||||
| WireGuard tunnel down | `python3 .claude/pfsense.py wireguard` |
|
||||
|
||||
## Notes
|
||||
1. **FreeBSD-based**: Commands differ from Linux (no `ip`, use `ifconfig`, `netstat`, `arp`)
|
||||
2. **pfctl is the firewall**: Rules loaded from config.xml via PHP, managed by pfctl
|
||||
3. **Config file**: `/cf/conf/config.xml` — all pfSense config in one XML file
|
||||
4. **PHP shell**: pfSense uses PHP for all config management; `config.inc` loads the config
|
||||
5. **Do NOT edit config.xml directly** — use the Web UI or PHP functions that properly reload services
|
||||
6. **Logs**: Binary circular logs, read with `clog -f /var/log/<logfile>`
|
||||
78
.claude/skills/post-mortem/skill.md
Normal file
78
.claude/skills/post-mortem/skill.md
Normal file
|
|
@ -0,0 +1,78 @@
|
|||
# Post-Mortem Writer
|
||||
|
||||
Generate a structured post-mortem document after an incident mitigation session.
|
||||
|
||||
## When to use
|
||||
- After `/post-mortem` command
|
||||
- Auto-suggested when cluster health transitions from UNHEALTHY → HEALTHY
|
||||
|
||||
## Instructions
|
||||
|
||||
1. **Gather context**:
|
||||
- Run `.claude/scripts/sev-context.sh` to capture current cluster state
|
||||
- Review the conversation history for: what broke, timeline, root cause, what was fixed
|
||||
- Check existing post-mortems at `docs/post-mortems/` for format reference
|
||||
|
||||
2. **Generate the post-mortem**:
|
||||
- Use the template at `.claude/skills/post-mortem/template.md`
|
||||
- Fill in all sections from the investigation context
|
||||
- **Critical**: In the Prevention Plan tables, set the `Type` column correctly:
|
||||
- `Alert` — add/modify Prometheus alerting rules (auto-implementable)
|
||||
- `Config` — change Terraform config, NFS options, etc. (auto-implementable)
|
||||
- `Monitor` — add Uptime Kuma monitors (auto-implementable)
|
||||
- `Architecture` — storage migration, stack redesign (human-only)
|
||||
- `Investigation` — needs further research (human-only)
|
||||
- `Runbook` — document a procedure (human-only)
|
||||
- `Migration` — data or service migration (human-only)
|
||||
- Items already fixed during the session should have Status = `Done`
|
||||
- Items not yet done should have Status = `TODO`
|
||||
|
||||
3. **File naming**: `docs/post-mortems/<YYYY-MM-DD>-<slug>.md`
|
||||
- Slug: lowercase, hyphenated, max 5 words describing the incident
|
||||
|
||||
4. **Update index**: Add an entry to `docs/post-mortems/index.html`
|
||||
- Add a new card in the incidents grid with date, severity tag, title, description
|
||||
|
||||
5. **Link to GitHub Issue** (if an issue exists for this incident):
|
||||
- Fill in the `Issue` field in the template metadata table with `[#N](https://github.com/ViktorBarzin/infra/issues/N)`
|
||||
- Add a comment to the GitHub Issue linking the postmortem:
|
||||
```bash
|
||||
GITHUB_TOKEN=$(vault kv get -field=github_pat secret/viktor)
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
-H "Accept: application/vnd.github.v3+json" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/comments" \
|
||||
-d '{"body": "**Postmortem:** [View postmortem](https://viktorbarzin.github.io/infra/post-mortems/<YYYY-MM-DD>-<slug>)"}'
|
||||
```
|
||||
- Add the `postmortem-done` label and remove `postmortem-required`:
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels" \
|
||||
-d '{"labels": ["postmortem-done"]}'
|
||||
curl -s -X DELETE \
|
||||
-H "Authorization: token $GITHUB_TOKEN" \
|
||||
"https://api.github.com/repos/ViktorBarzin/infra/issues/<N>/labels/postmortem-required"
|
||||
```
|
||||
- If no issue exists, create one with labels `incident`, `sev<N>`, `postmortem-done`
|
||||
|
||||
6. **Commit and push**:
|
||||
```
|
||||
git add docs/post-mortems/<file>.md docs/post-mortems/index.html
|
||||
git commit -m "docs: post-mortem for <date> <title> [ci skip]"
|
||||
git push origin master
|
||||
```
|
||||
- Use `[ci skip]` to avoid triggering app-stacks pipeline
|
||||
- NOTE: The postmortem-todos Woodpecker pipeline WILL trigger (it has its own path filter)
|
||||
|
||||
## Type Reference for Prevention Plan
|
||||
|
||||
| Type | Auto-implementable? | Examples |
|
||||
|------|---------------------|----------|
|
||||
| Alert | Yes | Add PrometheusRule, modify alert thresholds |
|
||||
| Config | Yes | Change Terraform variables, mount options, CronJob schedules |
|
||||
| Monitor | Yes | Add Uptime Kuma HTTP/TCP monitor |
|
||||
| Architecture | No | Migrate storage class, redesign HA topology |
|
||||
| Investigation | No | Research kernel bug, check Proxmox forum |
|
||||
| Runbook | No | Document recovery procedure |
|
||||
| Migration | No | Move data between storage backends |
|
||||
86
.claude/skills/post-mortem/template.md
Normal file
86
.claude/skills/post-mortem/template.md
Normal file
|
|
@ -0,0 +1,86 @@
|
|||
# Post-Mortem: <TITLE>
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | <DATE> |
|
||||
| **Duration** | <DURATION> |
|
||||
| **Severity** | <SEV1/SEV2/SEV3> |
|
||||
| **Affected Services** | <COUNT> pods across <COUNT> namespaces |
|
||||
| **Issue** | [#N](https://github.com/ViktorBarzin/infra/issues/N) |
|
||||
| **Status** | Draft |
|
||||
|
||||
## Summary
|
||||
|
||||
<1-2 sentence summary of the incident.>
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: <What users experienced>
|
||||
- **Blast radius**: <How many services/pods/namespaces affected>
|
||||
- **Duration**: <How long the outage lasted>
|
||||
- **Data loss**: <None/details>
|
||||
- **Monitoring gap**: <Any blind spots in alerting>
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| **HH:MM** | <First sign of trouble> |
|
||||
| **HH:MM** | <Detection / user report> |
|
||||
| **HH:MM** | <Investigation begins> |
|
||||
| **HH:MM** | <Root cause identified> |
|
||||
| **HH:MM** | <Fix applied> |
|
||||
| **HH:MM** | <Service restored> |
|
||||
|
||||
## Root Cause
|
||||
|
||||
<Narrative description of what went wrong and why.>
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
1. <Factor that made the incident worse or harder to detect>
|
||||
2. <Factor...>
|
||||
|
||||
## Detection Gaps
|
||||
|
||||
| Gap | Impact | Fix |
|
||||
|-----|--------|-----|
|
||||
| <What wasn't monitored> | <How it delayed detection> | <What to add> |
|
||||
|
||||
## Prevention Plan
|
||||
|
||||
### P0 — Prevent this exact failure
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P0 | <action> | Config | <details> | TODO |
|
||||
|
||||
### P1 — Reduce blast radius
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P1 | <action> | Alert | <details> | TODO |
|
||||
|
||||
### P2 — Detect faster
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P2 | <action> | Monitor | <details> | TODO |
|
||||
|
||||
### P3 — Improve resilience
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P3 | <action> | Architecture | <details> | TODO |
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. <Key takeaway>
|
||||
2. <Key takeaway>
|
||||
|
||||
## Follow-up Implementation
|
||||
|
||||
_This section is auto-populated by the postmortem-todo-resolver agent._
|
||||
|
||||
| Date | Action | Priority | Type | Commit | Implemented By |
|
||||
|------|--------|----------|------|--------|----------------|
|
||||
522
.claude/skills/setup-project/SKILL.md
Normal file
522
.claude/skills/setup-project/SKILL.md
Normal file
|
|
@ -0,0 +1,522 @@
|
|||
---
|
||||
name: setup-project
|
||||
description: |
|
||||
Deploy a new self-hosted service to the Kubernetes cluster from a GitHub repository.
|
||||
Use when: (1) User provides a GitHub URL or project name and wants to deploy it,
|
||||
(2) User says "deploy [service]" or "set up [service]",
|
||||
(3) User wants to add a new service to the cluster.
|
||||
Automated workflow: Docker image → Terraform module → Deploy.
|
||||
Handles database setup, ingress, DNS configuration.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2025-01-01
|
||||
---
|
||||
|
||||
# Setup Project Skill
|
||||
|
||||
**Purpose**: Deploy a new self-hosted service to the Kubernetes cluster from a GitHub repository.
|
||||
|
||||
**When to use**: User provides a GitHub URL or project name and wants to deploy it to the cluster.
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Research Phase
|
||||
|
||||
**Input**: GitHub repository URL or project name
|
||||
|
||||
**Actions**:
|
||||
- Visit the GitHub repository
|
||||
- Check the README for:
|
||||
- Official Docker image (Docker Hub, ghcr.io, etc.)
|
||||
- docker-compose.yml file
|
||||
- Self-hosting documentation
|
||||
- Required dependencies (PostgreSQL, MySQL, Redis, etc.)
|
||||
- Environment variables needed
|
||||
- Default ports
|
||||
- Storage requirements
|
||||
|
||||
**Find Docker Image Priority**:
|
||||
1. Check official documentation for recommended image
|
||||
2. Look in docker-compose.yml for `image:` directive
|
||||
3. Check GitHub Container Registry: `ghcr.io/<org>/<repo>`
|
||||
4. Check Docker Hub: `<org>/<repo>`
|
||||
5. Check releases page for container images
|
||||
6. Last resort: Build from Dockerfile (avoid if possible)
|
||||
|
||||
**Classify Dockerfile State** (drives whether we contribute a PR back upstream later):
|
||||
|
||||
| State | When | Action on deploy success |
|
||||
|---|---|---|
|
||||
| `image-used` | An official/community image worked (priority 1-5). | No upstream PR. Default case. |
|
||||
| `used-as-is` | Upstream ships a Dockerfile; it built and ran fine. | No upstream PR. |
|
||||
| `fixed-broken-upstream` | Upstream Dockerfile exists but fails to build / run; we patched it. | Open a `fix-dockerfile` PR after stability gate. |
|
||||
| `written-from-scratch` | Upstream has no Dockerfile at all; we authored one. | Open an `add-dockerfile` PR after stability gate. |
|
||||
|
||||
Record the chosen state and supporting metadata in `modules/kubernetes/<service>/.contribution-state.json`. When we author or fix a Dockerfile, also write `modules/kubernetes/<service>/files/Dockerfile`, `.dockerignore`, and `BUILD.md` (from `templates/Dockerfile.README.md`) — these travel with the upstream PR.
|
||||
|
||||
```json
|
||||
{
|
||||
"upstream_repo": "owner/name",
|
||||
"dockerfile_state": "written-from-scratch",
|
||||
"dockerfile_path_in_infra": "modules/kubernetes/<service>/files/Dockerfile",
|
||||
"deploy_target_url": "https://<service>.viktorbarzin.me",
|
||||
"image_tag": "registry.viktorbarzin.me/<service>:<sha>",
|
||||
"image_size": "<MB>",
|
||||
"base_image": "<e.g. python:3.12-slim>",
|
||||
"dockerfile_shape": "multi-stage, non-root, linux/amd64",
|
||||
"deploy_verified_at": null,
|
||||
"contribution_pr_url": null
|
||||
}
|
||||
```
|
||||
|
||||
**Dockerfile quality bar** (when writing one ourselves — enforced before PR):
|
||||
- Multi-stage build where it makes sense (Node, Go, Rust, Python with compiled deps).
|
||||
- Explicit non-root `USER`.
|
||||
- `HEALTHCHECK` when the app exposes a known endpoint.
|
||||
- Minimal base image (alpine / distroless preferred; `-slim` otherwise).
|
||||
- No secrets baked in; runtime config via `ENV`.
|
||||
- `.dockerignore` that excludes `.git`, `node_modules`, test artifacts.
|
||||
|
||||
**Extract Configuration**:
|
||||
- Container port (default port the app listens on)
|
||||
- Environment variables (DATABASE_URL, REDIS_HOST, SMTP, etc.)
|
||||
- Volume mounts (what data needs persistence)
|
||||
- Dependencies (database type, cache, etc.)
|
||||
|
||||
### 2. Database Setup (if needed)
|
||||
|
||||
**If project requires PostgreSQL**:
|
||||
- User provides database credentials or use pattern: `<service>` user with secure password
|
||||
- Database will be created in shared `postgresql.dbaas.svc.cluster.local`
|
||||
- Connection string format: `postgresql://<user>:<password>@postgresql.dbaas.svc.cluster.local:5432/<dbname>`
|
||||
|
||||
**If project requires MySQL**:
|
||||
- User provides database credentials
|
||||
- Database in shared `mysql.dbaas.svc.cluster.local`
|
||||
- Connection string format: `mysql://<user>:<password>@mysql.dbaas.svc.cluster.local:3306/<dbname>`
|
||||
|
||||
**If project requires Redis**:
|
||||
- Use shared Redis: `redis.redis.svc.cluster.local:6379`
|
||||
- No password required
|
||||
|
||||
**IMPORTANT**: Never create databases yourself - always ask user for credentials to use.
|
||||
|
||||
### 3. NFS Storage Setup (if service needs persistent data)
|
||||
|
||||
**IMPORTANT**: NFS directories must exist and be exported on the NFS server BEFORE deploying the service. If the directory doesn't exist, the pod will fail to mount the volume and get stuck in `ContainerCreating`.
|
||||
|
||||
**Steps**:
|
||||
|
||||
1. **Create the directory on the NFS server**:
|
||||
```bash
|
||||
ssh root@10.0.10.15 'mkdir -p /mnt/main/<service> && chmod 777 /mnt/main/<service>'
|
||||
```
|
||||
|
||||
2. **Export the directory via TrueNAS**:
|
||||
- The NFS export must be configured in TrueNAS so Kubernetes nodes can mount it
|
||||
- Create the export via TrueNAS WebUI or API, allowing access from the Kubernetes network (10.0.20.0/24)
|
||||
- Verify the export is accessible:
|
||||
```bash
|
||||
# From a k8s node or the dev VM
|
||||
showmount -e 10.0.10.15 | grep <service>
|
||||
```
|
||||
|
||||
3. **Verify the mount works before proceeding**:
|
||||
```bash
|
||||
# Quick test from a k8s node
|
||||
ssh root@10.0.20.100 'mount -t nfs 10.0.10.15:/mnt/main/<service> /tmp/test-mount && ls /tmp/test-mount && umount /tmp/test-mount'
|
||||
```
|
||||
|
||||
**Only proceed to Terraform module creation after confirming the NFS export is accessible.**
|
||||
|
||||
### 4. Terraform Module Creation
|
||||
|
||||
**Create module directory**:
|
||||
```bash
|
||||
mkdir -p modules/kubernetes/<service-name>/
|
||||
```
|
||||
|
||||
**Create `modules/kubernetes/<service-name>/main.tf`**:
|
||||
|
||||
```hcl
|
||||
variable "tls_secret_name" {}
|
||||
variable "tier" { type = string }
|
||||
variable "postgresql_password" {} # Only if needed
|
||||
# Add other variables as needed (smtp_password, api_keys, etc.)
|
||||
|
||||
resource "kubernetes_namespace" "<service>" {
|
||||
metadata {
|
||||
name = "<service>"
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
source = "../setup_tls_secret"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
|
||||
# If database migrations needed, add init_container
|
||||
resource "kubernetes_deployment" "<service>" {
|
||||
metadata {
|
||||
name = "<service>"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
labels = {
|
||||
app = "<service>"
|
||||
tier = var.tier
|
||||
}
|
||||
}
|
||||
spec {
|
||||
replicas = 1
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "<service>"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "<service>"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
# Init container for migrations (if needed)
|
||||
# init_container { ... }
|
||||
|
||||
container {
|
||||
name = "<service>"
|
||||
image = "<docker-image>:<tag>"
|
||||
|
||||
port {
|
||||
container_port = <port>
|
||||
}
|
||||
|
||||
# Environment variables
|
||||
env {
|
||||
name = "DATABASE_URL"
|
||||
value = "postgresql://<service>:${var.postgresql_password}@postgresql.dbaas.svc.cluster.local:5432/<service>"
|
||||
}
|
||||
# Add other env vars as needed
|
||||
|
||||
# Volume mounts for persistent data
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "<mount-path>"
|
||||
sub_path = "<optional-subpath>"
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
memory = "256Mi"
|
||||
cpu = "100m"
|
||||
}
|
||||
limits = {
|
||||
memory = "2Gi"
|
||||
cpu = "1"
|
||||
}
|
||||
}
|
||||
|
||||
# Health checks (if endpoints exist)
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/health" # or /healthz, /, etc.
|
||||
port = <port>
|
||||
}
|
||||
initial_delay_seconds = 60
|
||||
period_seconds = 30
|
||||
}
|
||||
}
|
||||
|
||||
# NFS volume for persistence
|
||||
volume {
|
||||
name = "data"
|
||||
nfs {
|
||||
server = "10.0.10.15"
|
||||
path = "/mnt/main/<service>"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "<service>" {
|
||||
metadata {
|
||||
name = "<service>"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
labels = {
|
||||
app = "<service>"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = {
|
||||
app = "<service>"
|
||||
}
|
||||
port {
|
||||
name = "http"
|
||||
port = 80
|
||||
target_port = <container-port>
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "ingress" {
|
||||
source = "../ingress_factory"
|
||||
namespace = kubernetes_namespace.<service>.metadata[0].name
|
||||
name = "<service>"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# Add extra_annotations if needed (proxy-body-size, timeouts, etc.)
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Update Main Terraform Files
|
||||
|
||||
**Add to `modules/kubernetes/main.tf`**:
|
||||
|
||||
1. Add variable declarations at top:
|
||||
```hcl
|
||||
variable "<service>_postgresql_password" { type = string }
|
||||
```
|
||||
|
||||
2. Add to appropriate DEFCON level (ask user which level, default to 5):
|
||||
```hcl
|
||||
5 : [
|
||||
...,
|
||||
"<service>"
|
||||
]
|
||||
```
|
||||
|
||||
3. Add module block at bottom:
|
||||
```hcl
|
||||
module "<service>" {
|
||||
source = "./<service>"
|
||||
for_each = contains(local.active_modules, "<service>") ? { <service> = true } : {}
|
||||
tls_secret_name = var.tls_secret_name
|
||||
postgresql_password = var.<service>_postgresql_password
|
||||
tier = local.tiers.aux # or appropriate tier
|
||||
|
||||
depends_on = [null_resource.core_services]
|
||||
}
|
||||
```
|
||||
|
||||
**Add to `main.tf`**:
|
||||
|
||||
1. Add variable:
|
||||
```hcl
|
||||
variable "<service>_postgresql_password" { type = string }
|
||||
```
|
||||
|
||||
2. Pass to kubernetes_cluster module:
|
||||
```hcl
|
||||
module "kubernetes_cluster" {
|
||||
...
|
||||
<service>_postgresql_password = var.<service>_postgresql_password
|
||||
}
|
||||
```
|
||||
|
||||
**Update `terraform.tfvars`**:
|
||||
|
||||
1. Add password/credentials:
|
||||
```hcl
|
||||
<service>_postgresql_password = "<secure-password>"
|
||||
```
|
||||
|
||||
2. Add to Cloudflare DNS (ask user if proxied or non-proxied):
|
||||
```hcl
|
||||
cloudflare_non_proxied_names = [
|
||||
...,
|
||||
"<service>"
|
||||
]
|
||||
```
|
||||
|
||||
### 6. Email/SMTP Configuration (if needed)
|
||||
|
||||
If service needs to send emails:
|
||||
```hcl
|
||||
env {
|
||||
name = "MAILER_HOST"
|
||||
value = "mailserver.viktorbarzin.me" # Public hostname for TLS
|
||||
}
|
||||
env {
|
||||
name = "MAILER_PORT"
|
||||
value = "587"
|
||||
}
|
||||
env {
|
||||
name = "MAILER_USER"
|
||||
value = "info@viktorbarzin.me"
|
||||
}
|
||||
env {
|
||||
name = "MAILER_PASSWORD"
|
||||
value = var.mailserver_accounts["info@viktorbarzin.me"] # Pass from module
|
||||
}
|
||||
```
|
||||
|
||||
Add to module call:
|
||||
```hcl
|
||||
smtp_password = var.mailserver_accounts["info@viktorbarzin.me"]
|
||||
```
|
||||
|
||||
### 7. Apply Terraform
|
||||
|
||||
```bash
|
||||
terraform init
|
||||
terraform apply -target=module.kubernetes_cluster.module.<service> -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
|
||||
**IMPORTANT: Also apply the cloudflared module to create the Cloudflare DNS record:**
|
||||
```bash
|
||||
terraform apply -target=module.kubernetes_cluster.module.cloudflared -var="kube_config_path=$(pwd)/config" -auto-approve
|
||||
```
|
||||
Without this step, the DNS record won't be created even though it's defined in `terraform.tfvars`.
|
||||
|
||||
### 8. Verification
|
||||
|
||||
```bash
|
||||
kubectl get pods -n <service>
|
||||
kubectl logs -n <service> -l app=<service> --tail=50
|
||||
```
|
||||
|
||||
Test URL: `https://<service>.viktorbarzin.me`
|
||||
|
||||
### 8b. Stability Gate (required when `dockerfile_state ∈ {written-from-scratch, fixed-broken-upstream}`)
|
||||
|
||||
Before committing — and before any upstream PR in §10 — run a 10-minute stability check to catch pods that crash-loop a few minutes after Ready.
|
||||
|
||||
```bash
|
||||
.claude/skills/setup-project/scripts/stability-gate.sh <service> <service> https://<service>.viktorbarzin.me
|
||||
```
|
||||
|
||||
Polls pod readiness + `curl` 200 every 30s × 20 iterations. Requires 18/20 successes (tolerates 2 blips).
|
||||
|
||||
- **Pass** → update the state file: `jq '.deploy_verified_at = (now | todate)' .contribution-state.json | sponge .contribution-state.json` → proceed to §9 and §10.
|
||||
- **Fail** → stop. Investigate via `kubectl logs`, `kubectl describe`. Do NOT commit. Do NOT fire §10. Re-run the gate after fixes.
|
||||
|
||||
For `image-used` / `used-as-is` states, the gate is optional (app is already running a known-good image).
|
||||
|
||||
### 9. Commit Changes
|
||||
|
||||
```bash
|
||||
git add modules/kubernetes/<service>/ main.tf modules/kubernetes/main.tf terraform.tfvars
|
||||
git commit -m "Add <service> deployment
|
||||
|
||||
- Deploy <service> as <description>
|
||||
- Uses <dependencies>
|
||||
- Ingress at <service>.viktorbarzin.me
|
||||
|
||||
[ci skip]"
|
||||
```
|
||||
|
||||
### 10. Contribute Dockerfile Upstream (only when `dockerfile_state ∈ {written-from-scratch, fixed-broken-upstream}`)
|
||||
|
||||
Goal: give the community the working Dockerfile we just validated in production.
|
||||
|
||||
**Preconditions** (script enforces):
|
||||
- `.contribution-state.json` present with a trigger state and `deploy_verified_at` set.
|
||||
- `files/Dockerfile`, `files/.dockerignore`, `files/BUILD.md` exist next to the module.
|
||||
- `GITHUB_TOKEN` in env — or `vault kv get -field=github_pat secret/viktor` is reachable.
|
||||
|
||||
**Run**:
|
||||
```bash
|
||||
.claude/skills/setup-project/scripts/contribute-dockerfile.sh modules/kubernetes/<service>
|
||||
```
|
||||
|
||||
**What the script does** (all via GitHub REST — `gh` CLI is sandbox-blocked):
|
||||
1. Reads `.contribution-state.json`; skips unless state is `written-from-scratch` or `fixed-broken-upstream` and no `contribution_pr_url` is already recorded.
|
||||
2. Upstream sanity checks: repo exists, public, not archived; default branch discoverable; for `written-from-scratch`, verifies a `Dockerfile` didn't land upstream while we were deploying; bails cleanly if an open PR from our fork already exists.
|
||||
3. `POST /repos/<owner>/<name>/forks` — idempotent; waits up to 30s for the fork to be ready at `ViktorBarzin/<name>`.
|
||||
4. `POST /repos/ViktorBarzin/<name>/merge-upstream` — keeps fork current with upstream default branch.
|
||||
5. Creates branch `add-dockerfile` (or `fix-dockerfile`), timestamp-suffixed if that branch already exists with unrelated commits.
|
||||
6. Commits `Dockerfile`, `.dockerignore`, `BUILD.md` via Contents API. Each commit message carries `Signed-off-by:` for DCO-enforcing repos.
|
||||
7. Opens PR against upstream with body rendered from `templates/PR_BODY.md`.
|
||||
8. Writes `contribution_pr_url` back into `.contribution-state.json` and echoes the URL.
|
||||
|
||||
**Failure handling**:
|
||||
- Upstream archived / private / deleted → logged as SKIP, deploy success stands.
|
||||
- Fork/branch/PR already exists → treated as idempotent success; existing URL recorded.
|
||||
- GitHub 5xx → 3× exponential backoff, then hard fail with a clear message — safe to re-run the script.
|
||||
|
||||
**After the PR opens**: the URL is in `.contribution-state.json`. Share it with the user. No automated follow-up on merge/reject — that's a manual check for now.
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Init Container for Migrations
|
||||
```hcl
|
||||
init_container {
|
||||
name = "migration"
|
||||
image = "<same-image>"
|
||||
command = ["sh", "-c", "<migration-command>"]
|
||||
|
||||
# Same env vars and volumes as main container
|
||||
}
|
||||
```
|
||||
|
||||
### Dynamic Environment Variables
|
||||
```hcl
|
||||
locals {
|
||||
common_env = [
|
||||
{ name = "VAR1", value = "value1" },
|
||||
{ name = "VAR2", value = "value2" },
|
||||
]
|
||||
}
|
||||
|
||||
dynamic "env" {
|
||||
for_each = local.common_env
|
||||
content {
|
||||
name = env.value.name
|
||||
value = env.value.value
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### External URL Configuration
|
||||
Many apps need their public URL configured:
|
||||
```hcl
|
||||
env {
|
||||
name = "APP_URL" # or PUBLIC_URL, EXTERNAL_URL, etc.
|
||||
value = "https://<service>.viktorbarzin.me"
|
||||
}
|
||||
env {
|
||||
name = "HTTPS" # or ENABLE_HTTPS, etc.
|
||||
value = "true"
|
||||
}
|
||||
```
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Find official Docker image or docker-compose
|
||||
- [ ] Identify dependencies (DB, Redis, etc.)
|
||||
- [ ] Ask user for database credentials (never create yourself)
|
||||
- [ ] Create NFS directory and export on TrueNAS (if persistent storage needed)
|
||||
- [ ] Verify NFS mount is accessible from k8s nodes
|
||||
- [ ] Create `modules/kubernetes/<service>/main.tf`
|
||||
- [ ] Classify `dockerfile_state` and write `.contribution-state.json`
|
||||
- [ ] If writing/fixing Dockerfile: satisfy the quality bar (multi-stage, non-root, `.dockerignore`, `BUILD.md`)
|
||||
- [ ] Update `modules/kubernetes/main.tf` (variables, DEFCON level, module block)
|
||||
- [ ] Update `main.tf` (variable, pass to module)
|
||||
- [ ] Update `terraform.tfvars` (password, Cloudflare DNS)
|
||||
- [ ] Run `terraform init` and `terraform apply`
|
||||
- [ ] Verify pods are running
|
||||
- [ ] Test the URL
|
||||
- [ ] Run stability-gate.sh — needed for contribution, optional otherwise
|
||||
- [ ] Commit changes with `[ci skip]`
|
||||
- [ ] Run contribute-dockerfile.sh if state triggers an upstream PR
|
||||
|
||||
## Questions to Ask User
|
||||
|
||||
1. What DEFCON level should this service be in? (Default: 5)
|
||||
2. Should Cloudflare proxy this domain? (Default: no, add to non_proxied_names)
|
||||
3. Does this need email/SMTP? (Configure if yes)
|
||||
4. What database credentials should I use? (Never create yourself)
|
||||
5. What tier? (core/cluster/gpu/edge/aux - default: aux)
|
||||
|
||||
## Notes
|
||||
|
||||
- **Always create NFS directories and exports BEFORE deploying** - pods will get stuck in `ContainerCreating` if the NFS path doesn't exist or isn't exported
|
||||
- **Always use official documentation** as the source of truth
|
||||
- **Prefer stable/latest tags** over specific versions for self-hosted
|
||||
- **Use shared infrastructure**: PostgreSQL at `postgresql.dbaas.svc.cluster.local`, Redis at `redis.redis.svc.cluster.local`
|
||||
- **NFS storage**: Always at `10.0.10.15:/mnt/main/<service>`
|
||||
- **Email**: Use `mailserver.viktorbarzin.me` (public hostname) not internal service name
|
||||
- **Resource limits**: Start conservative, can increase if needed
|
||||
- **Health checks**: Only add if the app has health endpoints
|
||||
270
.claude/skills/setup-project/scripts/contribute-dockerfile.sh
Executable file
270
.claude/skills/setup-project/scripts/contribute-dockerfile.sh
Executable file
|
|
@ -0,0 +1,270 @@
|
|||
#!/usr/bin/env bash
|
||||
# Contribute a working Dockerfile back to an upstream GitHub repo.
|
||||
#
|
||||
# Reads state from <service-module-dir>/.contribution-state.json and:
|
||||
# 1. Validates triggers (dockerfile_state ∈ {written-from-scratch, fixed-broken-upstream})
|
||||
# 2. Confirms upstream is public, not archived, no concurrent Dockerfile landed
|
||||
# 3. Forks upstream to ViktorBarzin (idempotent)
|
||||
# 4. Syncs fork with upstream default branch
|
||||
# 5. Creates branch (add-dockerfile or fix-dockerfile), appends -<ts> on collision
|
||||
# 6. Commits Dockerfile + .dockerignore + BUILD.md via Contents API
|
||||
# 7. Opens PR against upstream with body rendered from PR_BODY.md
|
||||
# 8. Writes contribution_pr_url back into state file
|
||||
#
|
||||
# Usage:
|
||||
# contribute-dockerfile.sh <service-module-dir>
|
||||
#
|
||||
# Example:
|
||||
# contribute-dockerfile.sh /home/wizard/code/infra/modules/kubernetes/myapp
|
||||
#
|
||||
# Requires: jq, curl, vault CLI (logged in).
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
TEMPLATES_DIR="$(cd "$SCRIPT_DIR/../templates" && pwd)"
|
||||
|
||||
FORK_OWNER="ViktorBarzin"
|
||||
|
||||
log() { echo "contribute-dockerfile: $*"; }
|
||||
die() { echo "contribute-dockerfile: ERROR: $*" >&2; exit 1; }
|
||||
skip() { echo "contribute-dockerfile: SKIP: $*"; exit 0; }
|
||||
|
||||
if [ "$#" -ne 1 ]; then
|
||||
die "usage: $0 <service-module-dir>"
|
||||
fi
|
||||
|
||||
MODULE_DIR="$1"
|
||||
STATE_FILE="$MODULE_DIR/.contribution-state.json"
|
||||
|
||||
[ -f "$STATE_FILE" ] || die "state file not found: $STATE_FILE"
|
||||
|
||||
# --- Read + validate state ---
|
||||
dockerfile_state=$(jq -r '.dockerfile_state // ""' "$STATE_FILE")
|
||||
upstream_repo=$(jq -r '.upstream_repo // ""' "$STATE_FILE")
|
||||
dockerfile_path=$(jq -r '.dockerfile_path_in_infra // ""' "$STATE_FILE")
|
||||
deploy_verified_at=$(jq -r '.deploy_verified_at // ""' "$STATE_FILE")
|
||||
existing_pr_url=$(jq -r '.contribution_pr_url // ""' "$STATE_FILE")
|
||||
|
||||
if [ -n "$existing_pr_url" ] && [ "$existing_pr_url" != "null" ]; then
|
||||
skip "PR already exists: $existing_pr_url"
|
||||
fi
|
||||
|
||||
case "$dockerfile_state" in
|
||||
written-from-scratch) BRANCH_NAME="add-dockerfile"; reason_type="none" ;;
|
||||
fixed-broken-upstream) BRANCH_NAME="fix-dockerfile"; reason_type="broken" ;;
|
||||
*) skip "dockerfile_state='$dockerfile_state' — nothing to contribute" ;;
|
||||
esac
|
||||
|
||||
[ -z "$deploy_verified_at" ] || [ "$deploy_verified_at" = "null" ] && die "deploy not verified yet (deploy_verified_at empty); run stability-gate first"
|
||||
|
||||
[ -z "$upstream_repo" ] && die "upstream_repo empty in state file"
|
||||
[[ "$upstream_repo" == */* ]] || die "upstream_repo must be owner/name, got: $upstream_repo"
|
||||
|
||||
UP_OWNER="${upstream_repo%/*}"
|
||||
UP_NAME="${upstream_repo#*/}"
|
||||
|
||||
abs_dockerfile="$MODULE_DIR/$(basename "$dockerfile_path")"
|
||||
if [ ! -f "$MODULE_DIR/files/Dockerfile" ]; then
|
||||
die "Dockerfile not found at $MODULE_DIR/files/Dockerfile"
|
||||
fi
|
||||
DOCKERFILE_SRC="$MODULE_DIR/files/Dockerfile"
|
||||
DOCKERIGNORE_SRC="$MODULE_DIR/files/.dockerignore"
|
||||
BUILDMD_SRC="$MODULE_DIR/files/BUILD.md"
|
||||
for f in "$DOCKERIGNORE_SRC" "$BUILDMD_SRC"; do
|
||||
[ -f "$f" ] || die "required file missing: $f"
|
||||
done
|
||||
|
||||
# --- GitHub auth ---
|
||||
GITHUB_TOKEN="${GITHUB_TOKEN:-$(vault kv get -field=github_pat secret/viktor 2>/dev/null || true)}"
|
||||
[ -n "$GITHUB_TOKEN" ] || die "GITHUB_TOKEN not set and vault lookup failed (vault login -method=oidc first)"
|
||||
|
||||
gh_api() {
|
||||
local method="$1"; local path="$2"; local data="${3:-}"
|
||||
local url="https://api.github.com${path}"
|
||||
local curl_args=(-sS -w "\n%{http_code}" -X "$method"
|
||||
-H "Authorization: token $GITHUB_TOKEN"
|
||||
-H "Accept: application/vnd.github+json"
|
||||
-H "X-GitHub-Api-Version: 2022-11-28")
|
||||
[ -n "$data" ] && curl_args+=(-d "$data")
|
||||
curl "${curl_args[@]}" "$url"
|
||||
}
|
||||
|
||||
gh_api_retry() {
|
||||
local method="$1"; local path="$2"; local data="${3:-}"
|
||||
local attempt=1
|
||||
local max_attempts=3
|
||||
local out http
|
||||
while [ "$attempt" -le "$max_attempts" ]; do
|
||||
out=$(gh_api "$method" "$path" "$data")
|
||||
http=$(printf '%s' "$out" | tail -n1)
|
||||
body=$(printf '%s' "$out" | sed '$d')
|
||||
if [ "$http" -ge 500 ] || [ "$http" = "000" ]; then
|
||||
log "retry $attempt/$max_attempts on $method $path (http=$http)"
|
||||
attempt=$((attempt + 1))
|
||||
sleep $((2 ** attempt))
|
||||
continue
|
||||
fi
|
||||
printf '%s\n%s' "$body" "$http"
|
||||
return 0
|
||||
done
|
||||
die "GitHub API 5xx after $max_attempts attempts on $method $path"
|
||||
}
|
||||
|
||||
# Helpers that parse the combined body+http form.
|
||||
gh_http() { printf '%s' "$1" | tail -n1; }
|
||||
gh_body() { printf '%s' "$1" | sed '$d'; }
|
||||
|
||||
# --- Upstream sanity checks ---
|
||||
log "checking upstream $upstream_repo"
|
||||
resp=$(gh_api_retry GET "/repos/$UP_OWNER/$UP_NAME")
|
||||
http=$(gh_http "$resp"); body=$(gh_body "$resp")
|
||||
if [ "$http" = "404" ]; then skip "upstream repo not found (may be private or deleted): $upstream_repo"; fi
|
||||
[ "$http" = "200" ] || die "GET upstream failed http=$http body=$body"
|
||||
|
||||
archived=$(printf '%s' "$body" | jq -r '.archived')
|
||||
default_branch=$(printf '%s' "$body" | jq -r '.default_branch')
|
||||
[ "$archived" = "true" ] && skip "upstream is archived — not opening PR"
|
||||
[ -n "$default_branch" ] || die "could not determine upstream default branch"
|
||||
log "upstream default branch: $default_branch"
|
||||
|
||||
# If we wrote the Dockerfile from scratch, make sure one didn't land upstream meanwhile.
|
||||
if [ "$dockerfile_state" = "written-from-scratch" ]; then
|
||||
resp=$(gh_api_retry GET "/repos/$UP_OWNER/$UP_NAME/contents/Dockerfile?ref=$default_branch")
|
||||
http=$(gh_http "$resp")
|
||||
if [ "$http" = "200" ]; then
|
||||
skip "a Dockerfile landed upstream since we started — aborting to avoid clobbering"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check for an existing open PR from our fork.
|
||||
resp=$(gh_api_retry GET "/repos/$UP_OWNER/$UP_NAME/pulls?state=open&head=${FORK_OWNER}:${BRANCH_NAME}")
|
||||
http=$(gh_http "$resp"); body=$(gh_body "$resp")
|
||||
if [ "$http" = "200" ]; then
|
||||
existing=$(printf '%s' "$body" | jq -r '.[0].html_url // ""')
|
||||
if [ -n "$existing" ]; then
|
||||
log "existing open PR found: $existing — recording and skipping"
|
||||
jq --arg url "$existing" '.contribution_pr_url = $url' "$STATE_FILE" > "$STATE_FILE.tmp" && mv "$STATE_FILE.tmp" "$STATE_FILE"
|
||||
exit 0
|
||||
fi
|
||||
fi
|
||||
|
||||
# --- Fork ---
|
||||
log "ensuring fork exists at $FORK_OWNER/$UP_NAME"
|
||||
resp=$(gh_api_retry POST "/repos/$UP_OWNER/$UP_NAME/forks" '{}')
|
||||
http=$(gh_http "$resp")
|
||||
if [ "$http" != "202" ] && [ "$http" != "200" ]; then
|
||||
die "fork call failed http=$http"
|
||||
fi
|
||||
|
||||
# Wait for fork to be ready (GitHub can take up to ~30s).
|
||||
for i in $(seq 1 15); do
|
||||
resp=$(gh_api_retry GET "/repos/$FORK_OWNER/$UP_NAME")
|
||||
if [ "$(gh_http "$resp")" = "200" ]; then break; fi
|
||||
sleep 2
|
||||
done
|
||||
[ "$(gh_http "$resp")" = "200" ] || die "fork $FORK_OWNER/$UP_NAME did not become ready"
|
||||
|
||||
# --- Sync fork with upstream default branch ---
|
||||
log "syncing fork with upstream/$default_branch"
|
||||
resp=$(gh_api_retry POST "/repos/$FORK_OWNER/$UP_NAME/merge-upstream" "$(jq -n --arg b "$default_branch" '{branch:$b}')")
|
||||
http=$(gh_http "$resp")
|
||||
[ "$http" = "200" ] || [ "$http" = "409" ] || log "merge-upstream returned http=$http (continuing)"
|
||||
|
||||
# --- Determine base SHA for new branch ---
|
||||
resp=$(gh_api_retry GET "/repos/$FORK_OWNER/$UP_NAME/git/ref/heads/$default_branch")
|
||||
http=$(gh_http "$resp"); body=$(gh_body "$resp")
|
||||
[ "$http" = "200" ] || die "could not read default branch ref on fork (http=$http)"
|
||||
base_sha=$(printf '%s' "$body" | jq -r '.object.sha')
|
||||
|
||||
# --- Create branch (or append timestamp on collision) ---
|
||||
attempt_branch="$BRANCH_NAME"
|
||||
resp=$(gh_api_retry GET "/repos/$FORK_OWNER/$UP_NAME/git/ref/heads/$attempt_branch")
|
||||
if [ "$(gh_http "$resp")" = "200" ]; then
|
||||
attempt_branch="${BRANCH_NAME}-$(date +%s | tail -c 9)"
|
||||
log "branch existed; using $attempt_branch"
|
||||
fi
|
||||
|
||||
log "creating branch $attempt_branch off $base_sha"
|
||||
payload=$(jq -n --arg r "refs/heads/$attempt_branch" --arg s "$base_sha" '{ref:$r,sha:$s}')
|
||||
resp=$(gh_api_retry POST "/repos/$FORK_OWNER/$UP_NAME/git/refs" "$payload")
|
||||
[ "$(gh_http "$resp")" = "201" ] || die "could not create branch: $(gh_body "$resp")"
|
||||
|
||||
# --- Helper to PUT a file via Contents API ---
|
||||
put_file() {
|
||||
local src="$1"; local dst="$2"; local message="$3"
|
||||
local b64 payload exists_resp http existing_sha=""
|
||||
b64=$(base64 -w0 < "$src")
|
||||
|
||||
exists_resp=$(gh_api_retry GET "/repos/$FORK_OWNER/$UP_NAME/contents/$dst?ref=$attempt_branch")
|
||||
if [ "$(gh_http "$exists_resp")" = "200" ]; then
|
||||
existing_sha=$(gh_body "$exists_resp" | jq -r '.sha')
|
||||
fi
|
||||
|
||||
if [ -n "$existing_sha" ]; then
|
||||
payload=$(jq -n --arg m "$message" --arg c "$b64" --arg b "$attempt_branch" --arg sha "$existing_sha" \
|
||||
'{message:$m, content:$c, branch:$b, sha:$sha}')
|
||||
else
|
||||
payload=$(jq -n --arg m "$message" --arg c "$b64" --arg b "$attempt_branch" \
|
||||
'{message:$m, content:$c, branch:$b}')
|
||||
fi
|
||||
|
||||
resp=$(gh_api_retry PUT "/repos/$FORK_OWNER/$UP_NAME/contents/$dst" "$payload")
|
||||
http=$(gh_http "$resp")
|
||||
[ "$http" = "200" ] || [ "$http" = "201" ] || die "PUT $dst failed http=$http body=$(gh_body "$resp")"
|
||||
}
|
||||
|
||||
commit_msg_prefix="Add Dockerfile"
|
||||
[ "$dockerfile_state" = "fixed-broken-upstream" ] && commit_msg_prefix="Fix Dockerfile"
|
||||
|
||||
log "committing Dockerfile, .dockerignore, BUILD.md"
|
||||
put_file "$DOCKERFILE_SRC" "Dockerfile" "$commit_msg_prefix
|
||||
|
||||
Signed-off-by: Viktor Barzin <viktorbarzin@meta.com>"
|
||||
put_file "$DOCKERIGNORE_SRC" ".dockerignore" "Add .dockerignore
|
||||
|
||||
Signed-off-by: Viktor Barzin <viktorbarzin@meta.com>"
|
||||
put_file "$BUILDMD_SRC" "BUILD.md" "Add BUILD.md
|
||||
|
||||
Signed-off-by: Viktor Barzin <viktorbarzin@meta.com>"
|
||||
|
||||
# --- Render PR body ---
|
||||
reason_paragraph="This project currently has no Dockerfile, making it harder for the self-hosting community to run this. I put together a working one while deploying this app to my home Kubernetes cluster and wanted to upstream it."
|
||||
if [ "$reason_type" = "broken" ]; then
|
||||
reason_paragraph="The existing Dockerfile in this repo does not build cleanly for \`linux/amd64\`. I tracked down the fixes while deploying this app to my home Kubernetes cluster and wanted to upstream them."
|
||||
fi
|
||||
|
||||
IMAGE_SIZE=$(jq -r '.image_size // "unknown"' "$STATE_FILE")
|
||||
BASE_IMAGE=$(jq -r '.base_image // "unknown"' "$STATE_FILE")
|
||||
IMAGE_TAG=$(jq -r '.image_tag // "myapp:latest"' "$STATE_FILE")
|
||||
DOCKERFILE_SHAPE=$(jq -r '.dockerfile_shape // "multi-stage, non-root, linux/amd64"' "$STATE_FILE")
|
||||
|
||||
pr_body=$(cat "$TEMPLATES_DIR/PR_BODY.md")
|
||||
pr_body="${pr_body//\{\{REASON_PARAGRAPH\}\}/$reason_paragraph}"
|
||||
pr_body="${pr_body//\{\{DOCKERFILE_SHAPE\}\}/$DOCKERFILE_SHAPE}"
|
||||
pr_body="${pr_body//\{\{IMAGE_SIZE\}\}/$IMAGE_SIZE}"
|
||||
pr_body="${pr_body//\{\{BASE_IMAGE\}\}/$BASE_IMAGE}"
|
||||
pr_body="${pr_body//\{\{IMAGE_TAG\}\}/$IMAGE_TAG}"
|
||||
|
||||
pr_title="$commit_msg_prefix"
|
||||
|
||||
# --- Open PR ---
|
||||
log "opening PR against $UP_OWNER/$UP_NAME:$default_branch"
|
||||
payload=$(jq -n \
|
||||
--arg t "$pr_title" \
|
||||
--arg h "${FORK_OWNER}:${attempt_branch}" \
|
||||
--arg b "$default_branch" \
|
||||
--arg body "$pr_body" \
|
||||
'{title:$t, head:$h, base:$b, body:$body, maintainer_can_modify:true}')
|
||||
resp=$(gh_api_retry POST "/repos/$UP_OWNER/$UP_NAME/pulls" "$payload")
|
||||
http=$(gh_http "$resp"); body=$(gh_body "$resp")
|
||||
if [ "$http" != "201" ]; then
|
||||
die "PR creation failed http=$http body=$body"
|
||||
fi
|
||||
|
||||
pr_url=$(printf '%s' "$body" | jq -r '.html_url')
|
||||
log "PR opened: $pr_url"
|
||||
|
||||
# --- Record PR URL in state file ---
|
||||
jq --arg url "$pr_url" '.contribution_pr_url = $url' "$STATE_FILE" > "$STATE_FILE.tmp" && mv "$STATE_FILE.tmp" "$STATE_FILE"
|
||||
log "state file updated with PR URL"
|
||||
71
.claude/skills/setup-project/scripts/stability-gate.sh
Executable file
71
.claude/skills/setup-project/scripts/stability-gate.sh
Executable file
|
|
@ -0,0 +1,71 @@
|
|||
#!/usr/bin/env bash
|
||||
# 10-minute deploy stability gate for setup-project skill.
|
||||
# Polls pod readiness + HTTP 200 on target URL every 30s for 20 iterations.
|
||||
# Requires 18/20 probes to succeed (tolerates 2 blips for restarts/DNS propagation).
|
||||
#
|
||||
# Usage:
|
||||
# stability-gate.sh <namespace> <app-label> <url>
|
||||
#
|
||||
# Example:
|
||||
# stability-gate.sh myapp myapp https://myapp.viktorbarzin.me
|
||||
#
|
||||
# Exit codes:
|
||||
# 0 - Stable (>=18/20 probes OK)
|
||||
# 1 - Unstable (<18/20 probes OK)
|
||||
# 2 - Usage error
|
||||
|
||||
set -u
|
||||
|
||||
if [ "$#" -ne 3 ]; then
|
||||
echo "Usage: $0 <namespace> <app-label> <url>" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
NS="$1"
|
||||
APP="$2"
|
||||
URL="$3"
|
||||
|
||||
TOTAL_PROBES=20
|
||||
MIN_SUCCESSES=18
|
||||
INTERVAL_SECONDS=30
|
||||
|
||||
ok_count=0
|
||||
fail_count=0
|
||||
|
||||
echo "stability-gate: ns=$NS app=$APP url=$URL"
|
||||
echo "stability-gate: $TOTAL_PROBES probes x ${INTERVAL_SECONDS}s (need $MIN_SUCCESSES/$TOTAL_PROBES)"
|
||||
|
||||
for i in $(seq 1 "$TOTAL_PROBES"); do
|
||||
probe_ok=true
|
||||
|
||||
if ! kubectl wait --for=condition=Ready pod -l "app=$APP" -n "$NS" --timeout=25s >/dev/null 2>&1; then
|
||||
probe_ok=false
|
||||
fi
|
||||
|
||||
status=$(curl -sS -o /dev/null -w "%{http_code}" --max-time 10 "$URL" || echo "000")
|
||||
if [ "$status" != "200" ]; then
|
||||
probe_ok=false
|
||||
fi
|
||||
|
||||
if [ "$probe_ok" = "true" ]; then
|
||||
ok_count=$((ok_count + 1))
|
||||
printf " probe %2d/%d: OK (http=%s)\n" "$i" "$TOTAL_PROBES" "$status"
|
||||
else
|
||||
fail_count=$((fail_count + 1))
|
||||
printf " probe %2d/%d: FAIL (http=%s)\n" "$i" "$TOTAL_PROBES" "$status"
|
||||
fi
|
||||
|
||||
if [ "$i" -lt "$TOTAL_PROBES" ]; then
|
||||
sleep "$INTERVAL_SECONDS"
|
||||
fi
|
||||
done
|
||||
|
||||
echo "stability-gate: results ok=$ok_count fail=$fail_count"
|
||||
|
||||
if [ "$ok_count" -ge "$MIN_SUCCESSES" ]; then
|
||||
echo "stability-gate: PASS"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "stability-gate: FAIL (need $MIN_SUCCESSES, got $ok_count)" >&2
|
||||
exit 1
|
||||
24
.claude/skills/setup-project/templates/Dockerfile.README.md
Normal file
24
.claude/skills/setup-project/templates/Dockerfile.README.md
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
# Build notes
|
||||
|
||||
## Build
|
||||
|
||||
```
|
||||
docker build --platform linux/amd64 -t {{IMAGE_NAME}}:{{TAG}} .
|
||||
```
|
||||
|
||||
## Run
|
||||
|
||||
```
|
||||
docker run --rm -p {{CONTAINER_PORT}}:{{CONTAINER_PORT}} {{IMAGE_NAME}}:{{TAG}}
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
{{ENV_VARS_TABLE}}
|
||||
|
||||
## Notes
|
||||
|
||||
- Built for `linux/amd64`; multi-arch not tested.
|
||||
- Image size: `{{IMAGE_SIZE}}`, base: `{{BASE_IMAGE}}`.
|
||||
- Runs as a non-root user.
|
||||
{{EXTRA_NOTES}}
|
||||
25
.claude/skills/setup-project/templates/PR_BODY.md
Normal file
25
.claude/skills/setup-project/templates/PR_BODY.md
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
## Add a working Dockerfile
|
||||
|
||||
### Why
|
||||
{{REASON_PARAGRAPH}}
|
||||
|
||||
### What this adds
|
||||
- `Dockerfile` — {{DOCKERFILE_SHAPE}}
|
||||
- `.dockerignore`
|
||||
- `BUILD.md` with the build command and notes
|
||||
|
||||
### Tested
|
||||
- Built and pushed to a private registry, deployed to a Kubernetes cluster.
|
||||
- Pod has been Ready and serving HTTP 200 at the ingress for 10+ minutes of continuous probing before this PR was opened.
|
||||
- Image size: {{IMAGE_SIZE}}, base: {{BASE_IMAGE}}
|
||||
- Platform tested: `linux/amd64`
|
||||
|
||||
### Build command
|
||||
```
|
||||
docker build --platform linux/amd64 -t {{IMAGE_TAG}} .
|
||||
```
|
||||
|
||||
Happy to iterate on base image, build args, or multi-arch support if you'd prefer a different shape. Thanks for the project!
|
||||
|
||||
---
|
||||
<sub>Contributed after self-hosting this project. Filed by the repo owner's deployment workflow; feel free to mention me (@ViktorBarzin) with any follow-ups.</sub>
|
||||
199
.claude/skills/upgrade-state/SKILL.md
Normal file
199
.claude/skills/upgrade-state/SKILL.md
Normal file
|
|
@ -0,0 +1,199 @@
|
|||
---
|
||||
name: upgrade-state
|
||||
description: |
|
||||
Audit the three autonomous-upgrade pipelines (apps via Keel, OS via
|
||||
unattended-upgrades+kured, K8s components via the version-check chain).
|
||||
Use when:
|
||||
(1) User asks "/upgrade-state" or "are we current",
|
||||
(2) User asks "what's pending upgrade" or "what's the upgrade state",
|
||||
(3) User asks if Keel / kured / k8s-version-check is healthy,
|
||||
(4) User asks about kept-back / held packages or pending reboots,
|
||||
(5) Periodic survey before the next `k8s-version-check` daily run.
|
||||
Read-only — no `--fix`. Exits 0 healthy / 1 attention / 2 stalled.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-05-18
|
||||
---
|
||||
|
||||
# Upgrade-state
|
||||
|
||||
## MANDATORY: Run the script first
|
||||
|
||||
When this skill is invoked, your **first action** must be to run
|
||||
`upgrade_state.sh` and reason over its output before doing anything
|
||||
else. Do NOT improvise individual `kubectl` / `ssh` calls — the script
|
||||
is the authoritative surface.
|
||||
|
||||
```bash
|
||||
bash /home/wizard/code/infra/scripts/upgrade_state.sh
|
||||
```
|
||||
|
||||
For programmatic use:
|
||||
|
||||
```bash
|
||||
bash /home/wizard/code/infra/scripts/upgrade_state.sh --json | tee /tmp/upgrade-state.json
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
1. Report the rendered table verbatim — it answers the user's
|
||||
"are we current" question in three lines.
|
||||
2. For every `⚠` or `✗` row, surface the relevant drill-down lines
|
||||
underneath and propose a next action (links in the table below).
|
||||
3. Only reach for ad-hoc commands when investigating beyond what the
|
||||
script reported.
|
||||
|
||||
Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken.
|
||||
|
||||
## What it covers (3 pipelines)
|
||||
|
||||
| Layer | What runs | Cadence | Data sources |
|
||||
|---|---|---|---|
|
||||
| **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs |
|
||||
| **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes |
|
||||
| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` |
|
||||
|
||||
The K8s pipeline pushes a small set of gauges to the Prometheus
|
||||
Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`):
|
||||
|
||||
- `k8s_upgrade_available{kind="patch"|"minor",target=…}` — 1 if newer release detected
|
||||
- `k8s_version_check_last_run_timestamp` — when detection last ran
|
||||
- `k8s_upgrade_in_flight` — 0/1
|
||||
- `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle)
|
||||
|
||||
`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has
|
||||
been running >90 minutes. The script raises `✗` in the same window.
|
||||
|
||||
## Status-icon legend
|
||||
|
||||
| Icon | Meaning |
|
||||
|---|---|
|
||||
| `✓` | Healthy, fully current |
|
||||
| `→` | Update available, not yet applied (K8s patch/minor) |
|
||||
| `…` | In flight — chain currently running |
|
||||
| `⚠` | Attention: held-with-bumps, recent errors, pending approvals |
|
||||
| `✗` | Broken: pod down, alert firing, chain stalled |
|
||||
|
||||
## Drill-down — when a row trips, what to do
|
||||
|
||||
### Apps `⚠` — pending approvals or errors
|
||||
|
||||
```bash
|
||||
# Read recent Keel log lines
|
||||
kubectl -n keel logs deploy/keel --since=24h --tail=200
|
||||
|
||||
# What is Keel currently tracking?
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/query?query=count by (image) (registries_scanned_total)'
|
||||
|
||||
# Is the scrape live?
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kubernetes-pods",app="keel"}'
|
||||
```
|
||||
|
||||
Common Keel errors:
|
||||
- `failed to add image watch job` — image annotation mistyped (rare; Kyverno auto-injects)
|
||||
- `registry authentication required` — bad imagePullSecret on the watched Deployment
|
||||
- `bad tag pattern` — Keel can't parse the watched image's tag against its policy
|
||||
|
||||
### OS `⚠` — held packages with bumps
|
||||
|
||||
The script flags any package held via `apt-mark hold` that ALSO appears
|
||||
in `apt list --upgradable` — excluding k8s components (the K8s pipeline
|
||||
owns those) and the kernel (kured handles the reboot half).
|
||||
|
||||
Typical cause: a major-version bump (e.g. containerd 1.7 → 2.2,
|
||||
runc 1.1 → 1.4). These are held because they need cluster-wide
|
||||
coordination, not silent in-release patching.
|
||||
|
||||
```bash
|
||||
# Inspect the situation on the flagged node
|
||||
ssh wizard@10.0.20.10X 'apt-mark showhold; apt list --upgradable 2>/dev/null'
|
||||
|
||||
# Unhold + upgrade a specific package
|
||||
ssh wizard@10.0.20.10X 'sudo apt-mark unhold containerd && sudo apt-get install -y containerd'
|
||||
```
|
||||
|
||||
Node IPs: master=`100`, node1=`101`, node2=`102`, node3=`103`, node4=`104`.
|
||||
|
||||
### OS `⚠` — pending reboot
|
||||
|
||||
A node has `/var/run/reboot-required`. Kured will reboot it inside the
|
||||
next 02:00-06:00 London window (any day of the week).
|
||||
|
||||
```bash
|
||||
# Force a manual reboot inside the window (rare)
|
||||
kubectl drain k8s-nodeX --delete-emptydir-data --ignore-daemonsets
|
||||
ssh wizard@10.0.20.10X sudo systemctl reboot
|
||||
```
|
||||
|
||||
### OS `✗` — kured not Running
|
||||
|
||||
```bash
|
||||
kubectl -n kured get pods
|
||||
kubectl -n kured logs daemonset/kured --tail=100
|
||||
# Verify sentinel gate (kured-sentinel-gate DaemonSet writes /var/run/gated-reboot-required)
|
||||
kubectl -n kured get pods -l name=kured-sentinel-gate
|
||||
```
|
||||
|
||||
### K8s `→` — patch/minor available
|
||||
|
||||
Detection ran, target identified, chain NOT started. The chain spawns
|
||||
on the same daily detection cycle — typically within ~24h of the
|
||||
target first being detected.
|
||||
|
||||
```bash
|
||||
# Inspect Pushgateway state
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- \
|
||||
wget -qO- 'http://prometheus-prometheus-pushgateway:9091/metrics' | grep ^k8s_upgrade
|
||||
|
||||
# Trigger a manual run of the detection CronJob
|
||||
kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s)
|
||||
```
|
||||
|
||||
### K8s `…` — in flight
|
||||
|
||||
The Job chain is running. Watch its progress:
|
||||
|
||||
```bash
|
||||
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp
|
||||
kubectl -n k8s-upgrade logs -l app=k8s-version-upgrade --tail=200 --prefix
|
||||
```
|
||||
|
||||
### K8s `✗ stalled` — `K8sUpgradeStalled` would fire
|
||||
|
||||
Chain in-flight >90m. The Job is most likely stuck on drain or a
|
||||
pre-flight check.
|
||||
|
||||
```bash
|
||||
kubectl -n k8s-upgrade get jobs
|
||||
kubectl -n k8s-upgrade describe job <stuck-job>
|
||||
kubectl -n k8s-upgrade logs job/<stuck-job> --tail=300
|
||||
|
||||
# If you need to clear the in-flight flag (after diagnosing):
|
||||
kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh -c \
|
||||
"printf 'k8s_upgrade_in_flight 0\nk8s_upgrade_started_timestamp 0\n' | \
|
||||
wget -qO- --post-file=- 'http://prometheus-prometheus-pushgateway:9091/metrics/job/k8s-version-upgrade' \
|
||||
--header='Content-Type: text/plain'"
|
||||
```
|
||||
|
||||
### K8s `✗ detection stale` — last detection >9 days
|
||||
|
||||
```bash
|
||||
kubectl -n k8s-upgrade get cronjob k8s-version-check
|
||||
kubectl -n k8s-upgrade get jobs --sort-by=.metadata.creationTimestamp | tail -5
|
||||
```
|
||||
|
||||
If the CronJob hasn't fired on time, suspect:
|
||||
- `suspend=true` on the CronJob (`var.enabled=false` in the
|
||||
`k8s-version-upgrade` Terraform stack)
|
||||
- Image-pull failure on the version-check pod
|
||||
- Pushgateway scrape gone stale
|
||||
|
||||
## Companion command-line flags
|
||||
|
||||
```bash
|
||||
bash infra/scripts/upgrade_state.sh # rendered table (default)
|
||||
bash infra/scripts/upgrade_state.sh --json # machine output
|
||||
bash infra/scripts/upgrade_state.sh --kubeconfig X # override kubeconfig
|
||||
```
|
||||
173
.claude/skills/uptime-kuma/SKILL.md
Normal file
173
.claude/skills/uptime-kuma/SKILL.md
Normal file
|
|
@ -0,0 +1,173 @@
|
|||
---
|
||||
name: uptime-kuma
|
||||
description: |
|
||||
Manage Uptime Kuma monitoring via the Python API. Use when:
|
||||
(1) User asks to add, remove, or list monitors,
|
||||
(2) User asks about service uptime or monitoring status,
|
||||
(3) User asks to check what's being monitored,
|
||||
(4) User deploys a new service and needs monitoring added,
|
||||
(5) User mentions "uptime", "monitoring", "health check", or "uptime kuma".
|
||||
Uptime Kuma v2 running in Kubernetes, managed via uptime-kuma-api Python library.
|
||||
author: Claude Code
|
||||
version: 1.0.0
|
||||
date: 2026-02-14
|
||||
---
|
||||
|
||||
# Uptime Kuma Monitoring Management
|
||||
|
||||
## Overview
|
||||
- **URL**: `https://uptime.viktorbarzin.me`
|
||||
- **Internal**: `uptime-kuma.uptime-kuma.svc.cluster.local:80`
|
||||
- **Image**: `louislam/uptime-kuma:2`
|
||||
- **Storage**: NFS at `/mnt/main/uptime-kuma` -> `/app/data`
|
||||
- **API Library**: `uptime-kuma-api` (pip, available via PYTHONPATH)
|
||||
- **Credentials**: admin / (from `UPTIME_KUMA_PASSWORD` env var)
|
||||
|
||||
## Python API Access
|
||||
|
||||
### Connection Pattern
|
||||
```python
|
||||
import os
|
||||
from uptime_kuma_api import UptimeKumaApi, MonitorType
|
||||
|
||||
api = UptimeKumaApi('https://uptime.viktorbarzin.me')
|
||||
api.login('admin', os.environ.get('UPTIME_KUMA_PASSWORD', ''))
|
||||
|
||||
# ... operations ...
|
||||
|
||||
api.disconnect()
|
||||
```
|
||||
|
||||
### Execution
|
||||
```bash
|
||||
python3 -c "
|
||||
import os
|
||||
from uptime_kuma_api import UptimeKumaApi, MonitorType
|
||||
api = UptimeKumaApi('https://uptime.viktorbarzin.me')
|
||||
api.login('admin', os.environ.get('UPTIME_KUMA_PASSWORD', ''))
|
||||
# ... your code ...
|
||||
api.disconnect()
|
||||
"
|
||||
```
|
||||
|
||||
### Common Operations
|
||||
|
||||
#### List All Monitors
|
||||
```python
|
||||
monitors = api.get_monitors()
|
||||
for m in monitors:
|
||||
print(f'{m["id"]:3d} | {m["name"]:30s} | {m["type"]:15s} | interval={m["interval"]}s')
|
||||
```
|
||||
|
||||
#### Add HTTP Monitor
|
||||
```python
|
||||
api.add_monitor(
|
||||
type=MonitorType.HTTP,
|
||||
name="Service Name",
|
||||
url="http://service.namespace.svc.cluster.local",
|
||||
interval=120,
|
||||
maxretries=2,
|
||||
)
|
||||
```
|
||||
|
||||
#### Add PING Monitor
|
||||
```python
|
||||
api.add_monitor(
|
||||
type=MonitorType.PING,
|
||||
name="Host Name",
|
||||
hostname="10.0.20.1",
|
||||
interval=30,
|
||||
maxretries=3,
|
||||
)
|
||||
```
|
||||
|
||||
#### Add PORT Monitor
|
||||
```python
|
||||
api.add_monitor(
|
||||
type=MonitorType.PORT,
|
||||
name="Service Port",
|
||||
hostname="service.namespace.svc.cluster.local",
|
||||
port=8080,
|
||||
interval=120,
|
||||
maxretries=2,
|
||||
)
|
||||
```
|
||||
|
||||
#### Edit Monitor
|
||||
```python
|
||||
api.edit_monitor(monitor_id, interval=120, maxretries=2)
|
||||
```
|
||||
|
||||
#### Delete Monitor
|
||||
```python
|
||||
api.delete_monitor(monitor_id)
|
||||
```
|
||||
|
||||
#### Pause/Resume Monitor
|
||||
```python
|
||||
api.pause_monitor(monitor_id)
|
||||
api.resume_monitor(monitor_id)
|
||||
```
|
||||
|
||||
## Monitor Types
|
||||
- `MonitorType.HTTP` — HTTP(S) endpoint check
|
||||
- `MonitorType.PING` — ICMP ping
|
||||
- `MonitorType.PORT` — TCP port check
|
||||
- `MonitorType.POSTGRES` — PostgreSQL connection
|
||||
- `MonitorType.REDIS` — Redis connection
|
||||
- `MonitorType.DNS` — DNS resolution check
|
||||
|
||||
## Tiered Monitoring System
|
||||
|
||||
Monitors use tiered intervals to balance responsiveness with resource usage:
|
||||
|
||||
| Tier | Interval | Retries | Use For |
|
||||
|------|----------|---------|---------|
|
||||
| **1 - Critical** | 30s | 3 | Core infra (DNS, gateway, ingress, NFS, K8s API, auth, mail) |
|
||||
| **2 - Important** | 120s | 2 | Actively used services (Nextcloud, Immich, Vaultwarden, etc.) |
|
||||
| **3 - Standard** | 300s | 1 | Auxiliary/optional services (blog, games, tools) |
|
||||
|
||||
### Tier Assignment Guidelines
|
||||
- **Tier 1**: If it goes down, multiple other services fail or the cluster is unreachable
|
||||
- **Tier 2**: User-facing services that are actively used daily
|
||||
- **Tier 3**: Nice-to-have services, tools, dashboards
|
||||
|
||||
### When Adding a New Service
|
||||
Match the tier to the service's DEFCON level from CLAUDE.md:
|
||||
- DEFCON 1-2 → Tier 1 (30s)
|
||||
- DEFCON 3-4 → Tier 2 (120s)
|
||||
- DEFCON 5 → Tier 3 (300s)
|
||||
|
||||
## Internal Service URL Pattern
|
||||
Most K8s services follow: `http://<service-name>.<namespace>.svc.cluster.local:<port>`
|
||||
|
||||
Common port is 80. Exceptions:
|
||||
- Homepage: port 3000
|
||||
- Ollama: port 11434
|
||||
- Loki: port 3100 (use `/ready` endpoint)
|
||||
- Traefik dashboard: port 8080 (use `/dashboard/` path)
|
||||
- K8s API: `https://10.0.20.100:6443`
|
||||
- Immich: port 2283 (use `/api/server/ping`)
|
||||
|
||||
## Notes
|
||||
1. Uptime Kuma uses Socket.IO (WebSocket) for its API, not REST
|
||||
2. The `uptime-kuma-api` Python library wraps Socket.IO
|
||||
3. Add `time.sleep(0.3)` between bulk operations to avoid overloading
|
||||
4. Homepage dashboard widget slug: `cluster-internal`
|
||||
5. Cloudflare-proxied at `uptime.viktorbarzin.me`
|
||||
|
||||
## Terraform-Managed Monitors
|
||||
|
||||
There is NO `louislam/uptime-kuma` Terraform provider. Two patterns exist for
|
||||
declarative monitor management in this stack:
|
||||
|
||||
- **External HTTPS monitors** — auto-discovered from ingress annotations by the
|
||||
`external-monitor-sync` CronJob (`*/10 * * * *`). Opt-out via
|
||||
`uptime.viktorbarzin.me/external-monitor: "false"` on the ingress.
|
||||
- **Internal monitors (DBs, non-HTTP)** — declared in the
|
||||
`local.internal_monitors` list in `stacks/uptime-kuma/modules/uptime-kuma/main.tf`
|
||||
and synced by the `internal-monitor-sync` CronJob. To add one, append to the
|
||||
list (provide `name`, `type`, `database_connection_string`,
|
||||
`database_password_vault_key`, `interval`, `retry_interval`, `max_retries`)
|
||||
and `scripts/tg apply`. The sync is idempotent — looks up by name, creates
|
||||
if missing, patches if drifted. Existing monitors keep their id and history.
|
||||
Loading…
Add table
Add a link
Reference in a new issue