diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index d2e581f4..9c873a07 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -24,8 +24,8 @@ Violations cause state drift, which causes future applies to break or silently revert changes. ## Instructions -- **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete `. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec. -- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies. +- **"remember X"**: store to the remote claude-memory store via the **`homelab memory` CLI**: `homelab memory store "content" --category facts --tags "tag1,tag2"` (also `recall "query"` / `update ` / `list` / `delete `). For shared knowledge, also update the relevant CLAUDE.md / `AGENTS.md`. (Supersedes the old `memory-tool` CLI **and** the claude-memory MCP — both retired 2026-06-21; the homelab CLI hits the same remote HTTP API. Recall also runs automatically each turn via a UserPromptSubmit hook.) +- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies, and `-lock-timeout` (default `5m`, override via `TG_LOCK_TIMEOUT`) on every state-locking verb (`plan`/`apply`/`destroy`/`refresh`) so a contended state lock **waits** instead of failing instantly with `Error acquiring the state lock`. - **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build. - **New service**: Use `setup-project` skill for full workflow - **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?": @@ -47,7 +47,7 @@ Violations cause state drift, which causes future applies to break or silently r ## Terraform State — Two-Tier Backend - **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable. -- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. +- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. **Lock contention is non-fatal**: `scripts/tg` passes `-lock-timeout` (default `5m`) so a contended lock waits rather than hard-failing — this was the #1 cause of infra CI failures (a Woodpecker-killed run's unreaped PG lock, a concurrent local apply, or the daily drift `plan`; Tier-1 stacks have no Vault advisory-lock skip to fall back on, unlike Tier-0). - **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`). - **Tier 0 workflow** (unchanged): `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`. State sync via SOPS is transparent. - **Tier 1 workflow**: `vault login -method=oidc` → `scripts/tg plan` → `scripts/tg apply`. No git commit needed — PG is authoritative. @@ -63,7 +63,7 @@ Violations cause state drift, which causes future applies to break or silently r - **`secret/viktor`** — go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first: `vault kv get -field=KEY secret/viktor`. - **Auth**: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault TF provider. - **Vault stack self-reads**: `data "vault_kv_secret_v2" "vault"` reads its own OIDC creds from `secret/vault`. -- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`. +- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — chart **2.6.0 / app v2.6.0** (migrated 0.12.1→2.6.0 on 2026-06-22, one minor at a time; helm_release has `atomic=true`). **~104 ExternalSecrets across 73 files**, all on **API version `v1`** (migrated v1beta1→v1 on 2026-06-22 — there is NO v1beta1→v1 conversion webhook, so all CRs were rewritten to v1 on chart 0.16.2 before 0.17 removed v1beta1; see `docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md`). Two ClusterSecretStores: `vault-kv` and `vault-database`. (2 pre-existing dead ESs — instagram-poster, payslip-ingest — fail "cannot find secret data" on missing Vault keys, unrelated.) - **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts. - **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules. - **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: `) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr); matrix has since migrated to tuwunel (RocksDB, no Postgres) on 2026-06-08 and is no longer in the rotation list above. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances. @@ -130,7 +130,7 @@ ghcr, NOT DockerHub), kms-website, Freedify, instagram-poster, payslip-ingest, broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder, x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website, apple-health-data, audiblez-web, plotting-book, insta2spotify, -audiobook-search, council-complaints) now also land on ghcr. +audiobook-search) now also land on ghcr. - **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator. @@ -202,7 +202,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). - **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared. - **PDBs**: minAvailable=2 on Traefik and Authentik. - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down. -- **CrowdSec bouncer**: graceful degradation mode (fail-open on error). +- **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`. - **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations). - **Retry middleware**: 2 attempts, 100ms — in default ingress chain. - **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts". @@ -216,7 +216,7 @@ the workflow's built-in `GITHUB_TOKEN` (`packages: write`). |---------|--------------------------| | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | -| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob | +| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | | Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. | | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | diff --git a/.claude/home-assistant-sofia.py b/.claude/home-assistant-sofia.py index b0ccdca7..d8121f6c 100644 --- a/.claude/home-assistant-sofia.py +++ b/.claude/home-assistant-sofia.py @@ -7,6 +7,7 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me. import argparse import json import os +import subprocess import sys from urllib.parse import urljoin @@ -17,13 +18,29 @@ except ImportError: print(" pip install requests") sys.exit(1) -# Configuration from environment variables (ha-sofia specific) -HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") -HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") -if not HA_URL or not HA_TOKEN: - print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.") - print("These should be set when activating the Claude venv (~/.venvs/claude)") +def _token_from_homelab(): + """Resolve the token via the homelab CLI when the env var isn't set, so the + script works from any directory / unprovisioned session (see ADR-0012).""" + try: + out = subprocess.run( + ["homelab", "ha", "token", "--instance", "sofia"], + capture_output=True, text=True, timeout=30) + if out.returncode == 0 and out.stdout.strip(): + return out.stdout.strip() + except Exception: + pass + return None + + +# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to +# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012). +HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me" +HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab() + +if not HA_TOKEN: + print("ERROR: no ha-sofia API token available.") + print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).") sys.exit(1) HEADERS = { diff --git a/.claude/reference/authentik-state.md b/.claude/reference/authentik-state.md index 125d1a71..2ff86141 100644 --- a/.claude/reference/authentik-state.md +++ b/.claude/reference/authentik-state.md @@ -166,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`: | Knob | Value | Surface | Effect | |------|-------|---------|--------| -| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. | +| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). | +| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) | | `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. | | `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. | @@ -177,6 +178,13 @@ Notes: - The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts). - `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`. - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds. + +## WebAuthn / Passkeys (2026-06-20) + +- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey). +- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe. +- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records. +- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.) - ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns). - **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config. - **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin. diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 516cd63f..cd7b5274 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -32,7 +32,7 @@ |---------|-------------|-------| | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard | | reverse-proxy | Generic reverse proxy | reverse-proxy | -| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 AUTO-TRACKS the `nightly` npm dist-tag** (Viktor 2026-06-16, reversing the post-2026-06-09 pin; churn risk accepted) — `t3-autoupdate` is a daily GATED tracker that follows `t3@nightly` but gates every bump so a bad build self-heals: downgrade-guard → pre-bump `VACUUM INTO` backup → health-check that SEEDS a copy of a real POPULATED `state.sqlite` to exercise the forward migration + the real mint→exchange→`t3_session` pairing handshake → canary-restart idle instances ONE AT A TIME with per-instance dispatch pairing verify → auto-rollback to last-good + self-freeze on failure (active-agent instances deferred, never killed; last-good in `/var/lib/t3-autoupdate/last-good`). The 2026-06-09 outage was the SAME nightly channel WITHOUT these gates. Freeze/revert now: `sudo touch /etc/t3-autoupdate.freeze` (or set `T3_PIN=` to hard-pin); preview a build with `T3_DRY_RUN=1`. Channel via `T3_TRACK` in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Full ops + manual rollback: `docs/runbooks/t3-version-bump.md`. `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works, and dispatch **auto-pair was re-verified healthy on the live pin 2026-06-16** (cookieless `X-authentik-username` → 302 + `t3_session`) — the earlier transient 401 note no longer reproduces, and the new dispatch pairing logs + `T3PairingBroken`/`T3PairFallbackHigh` Loki alerts now watch pairing continuously. | t3code | +| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`, `t3-safe-restart.sh`, `t3-migrate-idle.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 AUTO-TRACKS the `nightly` npm dist-tag** (Viktor 2026-06-16, reversing the post-2026-06-09 pin; churn risk accepted) — `t3-autoupdate` is a daily GATED tracker that follows `t3@nightly` but gates every bump so a bad build self-heals: downgrade-guard → pre-bump `VACUUM INTO` backup → health-check that SEEDS a copy of a real POPULATED `state.sqlite` to exercise the forward migration + the real mint→exchange→`t3_session` pairing handshake → canary-restart idle instances ONE AT A TIME with per-instance dispatch pairing verify → auto-rollback to last-good + self-freeze on failure (active-agent instances deferred, never killed; last-good in `/var/lib/t3-autoupdate/last-good`). **Deferred instances are drained overnight by `t3-migrate-idle.timer`** (every 20 min 01:00–05:40): it restarts a still-stale `t3-serve@` onto the current binary only when that user's `state.sqlite` shows no in-flight turn (`active_turn_id`) + ≥15 min quiet (`T3_MIGRATE_QUIET_SECONDS`), via the shared `t3-safe-restart.sh` (the same backup→restart→verify→recover helper the canary uses) — fixing the chronic skew where a user busy at every 04:00 window never migrated and saw "Client and server versions differ". The 2026-06-09 outage was the SAME nightly channel WITHOUT these gates. Freeze/revert now: `sudo touch /etc/t3-autoupdate.freeze` (or set `T3_PIN=` to hard-pin); preview a build with `T3_DRY_RUN=1`. Channel via `T3_TRACK` in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Full ops + manual rollback: `docs/runbooks/t3-version-bump.md`. `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works, and dispatch **auto-pair was re-verified healthy on the live pin 2026-06-16** (cookieless `X-authentik-username` → 302 + `t3_session`) — the earlier transient 401 note no longer reproduces, and the new dispatch pairing logs + `T3PairingBroken`/`T3PairFallbackHigh` Loki alerts now watch pairing continuously. | t3code | ## Active Use | Service | Description | Stack | @@ -57,7 +57,7 @@ | trading-bot | Event-driven trading with sentiment analysis | trading-bot | | claude-memory | Persistent memory MCP server | claude-memory | | paperless-mcp | Paperless-ngx document search MCP (barryw/PaperlessMCP). Traefik bearer auth via Aetherinox api-token-middleware. `auth=none` at ingress; gateway-level bearer enforced by `paperless-mcp/bearer-auth` Middleware CRD. Tokens + paperless API token in Vault `secret/paperless-mcp`. | paperless-mcp | -| council-complaints | Islington civic reporting pilot | council-complaints | +| paperless-ai | AI layer over Paperless-ngx (clusterzx/paperless-ai): semantic/RAG document search (Chat) + auto-tagging. Local embeddings (sentence-transformers MiniLM) + ChromaDB on the PVC — search is GPU-free. LLM (chat answers + tagging) via in-cluster llama-swap `qwen3-8b` (`SYSTEM_PROMPT=/no_think` to keep Qwen3 output parseable). `auth=required` (Authentik) at `paperless-ai.viktorbarzin.me`. Reads Paperless over the internal svc as a dedicated `paperless-ai` superuser. **Runtime config + app-admin live in the PVC `.env`/SQLite (written once via the app's setup flow), NOT TF env — its dotenv loader does not override `process.env`, so container env shadows the `.env`.** Vault `secret/paperless-ai` (paperless_api_token, api_key, custom_api_key, app_admin_*). | paperless-ai | ## Optional | Service | Description | Stack | @@ -95,7 +95,7 @@ | n8n | Workflow automation | n8n | | real-estate-crawler | Property crawler | real-estate-crawler | | tor-proxy | Tor proxy | tor-proxy | -| forgejo | Git forge | forgejo | +| forgejo | Git forge. Open native self-signup (Turnstile captcha + email confirm) + Authentik & GitHub OAuth sign-in; see `docs/runbooks/forgejo-open-signups.md` | forgejo | | freshrss | RSS reader | freshrss | | navidrome | Music streaming | navidrome | | networking-toolbox | Network tools | networking-toolbox | diff --git a/.claude/skills/home-assistant/SKILL.md b/.claude/skills/home-assistant/SKILL.md index fe761f8c..ab07a27f 100644 --- a/.claude/skills/home-assistant/SKILL.md +++ b/.claude/skills/home-assistant/SKILL.md @@ -11,8 +11,8 @@ description: | There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. Always use Home Assistant for smart home control. author: Claude Code -version: 2.0.0 -date: 2026-02-07 +version: 2.1.0 +date: 2026-06-24 --- # Home Assistant Control @@ -44,6 +44,12 @@ There are **two** Home Assistant instances: - Environment variables for each instance: - **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN` - **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN` + - If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory): + +## homelab CLI (preferred — works from any directory) +- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.) +- **Host shell** (ha-sofia): `homelab ha ssh -- ` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations. +- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query ""` / `homelab logs query ""` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly. ## API Control @@ -389,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr ## ha-london Knowledge Map ### Overview -- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi) +- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied). - **Location**: London, UK -- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone) -- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) -- **Config path**: `/config/` (requires `sudo` for file access) +- **Platform**: Raspberry Pi 4, HA OS +- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs. +- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) +- **Config path**: `/config/` - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **Zone**: London (home) +### Dashboards (redesigned 2026-06-24) +**Glossary** (HA terms — keep distinct): +- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config. +- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config). +- **Card** = a widget inside a view. + +- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card. + - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night). + - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*. +- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.) +- Built via the WS `lovelace/config/save` API (london is remote — no SSH path). + ### Key Systems #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring @@ -418,10 +437,15 @@ Named plugs with power/energy tracking: - PM1.0/2.5/4.0/10 particulate sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors -#### 3. Cowboy E-Bike -- `sensor.bike_state_of_charge`: Battery % -- `sensor.bike_total_distance`: Total km -- `sensor.bike_total_co2_saved`: CO2 saved (grams) +#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`) +Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration). +- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`) +- `sensor.classic_performance_remaining_range`: Range km +- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`) +- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`) +- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc. +- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless. +- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`). #### 4. Uptime Monitoring (UptimeRobot) - `sensor.blog`: blog uptime @@ -440,12 +464,17 @@ Named plugs with power/energy tracking: - Scripts: `script.start_netflix`, `script.start_stremio` - Scene: `scene.night` (turns off Livia + Michelle plugs) -### Custom Components -- **cowboy**: Cowboy e-bike integration (HACS) -- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS) +### Custom Components (HACS integrations) +- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it. +- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken. + +### HACS frontend cards (plugins) +- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode. ### Integrations -ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB +ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB. +- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy). +- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is. ### AI / Voice Assistants - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air @@ -460,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL - Anca arrival/departure notifications - Night scene: turns off Livia + Michelle -### Docker Setup -```bash -docker run -d --name homeassistant --privileged \ - -e TZ=Europe/London \ - -v /home/pi/docker/homeAssistant:/config \ - -v /run/dbus:/run/dbus:ro \ - --network=host --restart=unless-stopped \ - homeassistant/home-assistant:2025.9 -``` +### Platform (HAOS — ignore any legacy `docker run` snippet) +ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker). ### SSH Access ```bash diff --git a/.github/workflows/build-chrome-service-browser.yml b/.github/workflows/build-chrome-service-browser.yml new file mode 100644 index 00000000..9d2129c8 --- /dev/null +++ b/.github/workflows/build-chrome-service-browser.yml @@ -0,0 +1,39 @@ +name: Build chrome-service-browser + +# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base + +# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service +# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds +# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr +# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so +# the pod pulls it without credentials. +on: + push: + branches: [master] + paths: + - 'stacks/chrome-service/files/chrome/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/chrome-service/files/chrome + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/chrome-service-browser:latest + ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }} diff --git a/.gitignore b/.gitignore index 620d5a97..b288aed5 100755 --- a/.gitignore +++ b/.gitignore @@ -110,3 +110,9 @@ terraform.tfstate.backup # Timestamped terraform state backups (terraform.tfstate..backup) — plaintext Tier-0 # secrets; created by terraform state ops. The patterns above miss the timestamped form. terraform.tfstate.*.backup + +# Python test artifacts (pytest bytecode cache) — e.g. from +# stacks/k8s-version-upgrade/scripts/test_compat_gate.py +__pycache__/ +*.pyc +.pytest_cache/ diff --git a/.woodpecker/default.yml b/.woodpecker/default.yml index 95fcbd80..ef94ccee 100644 --- a/.woodpecker/default.yml +++ b/.woodpecker/default.yml @@ -19,6 +19,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 2 attempts: 5 backoff: 10s diff --git a/.woodpecker/drift-detection.yml b/.woodpecker/drift-detection.yml index 5851bc16..b2e303ff 100644 --- a/.woodpecker/drift-detection.yml +++ b/.woodpecker/drift-detection.yml @@ -9,6 +9,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 1 attempts: 3 diff --git a/.woodpecker/issue-automation.yml b/.woodpecker/issue-automation.yml index ece97dab..2bb46661 100644 --- a/.woodpecker/issue-automation.yml +++ b/.woodpecker/issue-automation.yml @@ -5,6 +5,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 2 steps: diff --git a/.woodpecker/postmortem-todos.yml b/.woodpecker/postmortem-todos.yml index 729e9a85..68330272 100644 --- a/.woodpecker/postmortem-todos.yml +++ b/.woodpecker/postmortem-todos.yml @@ -11,6 +11,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 5 steps: diff --git a/.woodpecker/provision-user.yml b/.woodpecker/provision-user.yml index 0f6d5dab..3ba7af7f 100644 --- a/.woodpecker/provision-user.yml +++ b/.woodpecker/provision-user.yml @@ -5,6 +5,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false attempts: 5 backoff: 10s diff --git a/.woodpecker/pve-nfs-exports-sync.yml b/.woodpecker/pve-nfs-exports-sync.yml index 2c26df45..54aea68a 100644 --- a/.woodpecker/pve-nfs-exports-sync.yml +++ b/.woodpecker/pve-nfs-exports-sync.yml @@ -23,6 +23,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 1 attempts: 3 diff --git a/.woodpecker/registry-config-sync.yml b/.woodpecker/registry-config-sync.yml index a4f03185..aad59fbe 100644 --- a/.woodpecker/registry-config-sync.yml +++ b/.woodpecker/registry-config-sync.yml @@ -38,6 +38,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 1 attempts: 3 diff --git a/.woodpecker/renew-tls.yml b/.woodpecker/renew-tls.yml index d2d8bf89..cd93fe7c 100644 --- a/.woodpecker/renew-tls.yml +++ b/.woodpecker/renew-tls.yml @@ -6,6 +6,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false attempts: 5 backoff: 10s diff --git a/AGENTS.md b/AGENTS.md index 797ed5df..7fbc838d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -9,7 +9,7 @@ - **Ask before `git push`** — always confirm with the user first ## Execution -- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets) +- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`) - **Legacy apply**: `cd stacks/ && terragrunt apply --non-interactive` (uses terraform.tfvars) - **kubectl**: `kubectl --kubeconfig $(pwd)/config` - **Health check**: `bash scripts/cluster_healthcheck.sh --quiet` @@ -289,6 +289,7 @@ curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' ``` ## Common Operations +- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply ` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release :`, `homelab work start|land|clean ` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status ` (read; `` defaults to the namespace, target to `deploy/`), `homelab k8s db [--mysql] -- ""`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall ""` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait / [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check [path]` (external-CF vs internal-LB reachability), `dns lookup ` (Technitium vs public diff), `metrics query ""` / `metrics alerts` (Prometheus via LB), `logs query "" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- ` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`. - **Deploy new service**: Use `stacks//` as template. Create stack, add DNS in tfvars, apply platform then service. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n `. Increase `resources.limits.memory` in the stack's main.tf. diff --git a/CONTEXT.md b/CONTEXT.md index d700f9ab..2b9bb8b3 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -117,9 +117,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se _Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP. **Calico**: -The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred). +The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred). _Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers. +**Service identity**: +How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh). +_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object. + +**Goldmane / Whisker**: +Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail. +_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh). + ### Storage **proxmox-lvm-encrypted**: diff --git a/cli/README.md b/cli/README.md index 48b83c93..186c1ee5 100644 --- a/cli/README.md +++ b/cli/README.md @@ -1,2 +1,224 @@ -# What is this? -This is a CLI to manipulate files in the terraform repo and commit and push them +# homelab + +`homelab` is the unified, agent-facing CLI for operating this homelab — one +composable, JSON-capable surface for the operations agents run over and over, +discovered progressively at runtime. It is grown **in place** from this +directory (the former `infra-cli`), and the legacy webhook use-cases still work +(see below). + +It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and +third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope. + +## Usage + +``` +homelab [args] +homelab manifest [--json] # list every verb + its read/write tier (discovery entrypoint) +homelab version +``` + +### v0.1 verbs — the infra inner-loop + +| Command | Tier | What it does | +|---|---|---| +| `claim : --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) | +| `release :` | write | release a presence claim | +| `tf plan ` | read | `scripts/tg plan` for a stack (resolved from cwd) | +| `tf validate ` | read | `scripts/tg validate` | +| `tf fmt ` | read | `terraform fmt -recursive` on the stack | +| `tf force-unlock ` | write | release a stuck state lock | +| `tf apply ` | write | `scripts/tg apply` — auto-claims `stack:`, always releases, warns it's out-of-band | +| `work start ` | write | create `.worktrees/` on `/` off `/master`; enter with native `EnterWorktree` | +| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) | +| `work clean ` | write | remove a task's worktree + branch (run from the main checkout) | + +### v0.2 verbs — Kubernetes + +Built on an **app→namespace→pod resolver**: `` defaults to the namespace +(most namespaces hold one app); the target defaults to `deploy/` and lets +kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the +ambient kubeconfig. + +| Command | Tier | What it does | +|---|---|---| +| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) | +| `k8s get […]` | read | `kubectl -n get …` passthrough | +| `k8s logs ` | read | logs for `deploy/` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) | +| `k8s describe [resource]` | read | describe the deployment (or an explicit resource) | +| `k8s debug ` | read | one-shot triage: pods + workloads + describe + recent logs + events | +| `k8s pf [target]` | read | port-forward to `svc/` (or an explicit target) | +| `k8s rollout-status ` | read | `rollout status deploy/` | +| `k8s db [--mysql] [--db N] -- ""` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) | +| `k8s exec [--tty] -- ` | write | exec in the app's pod | +| `k8s restart ` | write | `rollout restart deploy/` then wait for status | +| `k8s rm-pod -n [--job] [--force]` | write | delete a stuck **pod/job only** | + +Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally +**not** exposed — they stay raw `kubectl`, per the Terraform-only policy. + +`tf` resolves the stack dir by walking up from cwd to the infra root and +delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and +the ingress auth-comment check). git-crypt filter flags are auto-injected on git +operations in the encrypted infra repo. + +**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no +auto-detected suite) unless you pass `--no-verify` — landing to master unverified +must be deliberate. After pushing it **watches CI to green** (`ci watch` on the +landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip. + +Tiers are recorded per verb so a future PreToolUse classifier can auto-allow +reads / prompt writes; v0.1 allows everything and relies on existing gates +(permission mode, presence claims, plan approval). + +### v0.3 verbs — memory + +A thin HTTP client over the **claude-memory** service (the same backend the +memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against +`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the +ingress). Because it hits the HTTP API directly, it **works even when the MCP +frontend is down**. + +| Command | Tier | What it does | +|---|---|---| +| `memory recall "" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse | +| `memory list [--category --tag --limit]` | read | recent memories | +| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store | +| `memory secret ` | read | reveal a sensitive memory's content | +| `memory store "" [--category --tags --keywords --importance --sensitive]` | write | store a memory | +| `memory update [--content --tags --importance]` | write | edit a memory | +| `memory delete ` | write | delete a memory | + +All read/write paths are validated against the live API (incl. a +store→recall→delete round-trip). This gives full data-plane parity with the MCP; +the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks +to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up** — +see `docs/adr/0008`. + +### v0.4 verbs — ci / deploy + +Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci` +talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault +`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd +remote, with retries that ride Woodpecker's intermittent empty responses. + +| Command | Tier | What it does | +|---|---|---| +| `ci status [commit]` | read | pipeline status for HEAD (or a commit) | +| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure | +| `deploy wait / [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) | + +`work land` now calls `ci watch` on the landed commit automatically (skip with +`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing +step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were +the least reliable; `status`/`watch` use the list endpoint that works. + +### v0.5 verbs — net / dns / metrics / logs + +Reachability + observability probes. Their value is *endpoint resolution* — the +non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd +otherwise re-derive every time — not the HTTP call itself. All reach internal +ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`). + +| Command | Tier | What it does | +|---|---|---| +| `net check [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) | +| `dns lookup [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps | +| `metrics query ""` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` | +| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) | +| `logs query "" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` | + +Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward, +no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the +firing set is reachable via `ALERTS` instead.) + +### v0.6 — usage telemetry (`usage top`) + +Makes "which verbs are actually used, by everyone" a query instead of a guess — +so adding the *next* verb is evidence-driven, not shaped by one person's habits. + +Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}` +labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths, +flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never +affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is +the shared Loki, aggregate usage is queryable **without reading anyone's home** — +the privacy-preserving answer to "what does the team use." + +| Command | Tier | What it does | +|---|---|---| +| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` | + +### v0.7 verbs — Home Assistant + +Cover exactly the two things the `ha` **MCP server can't**: resolving the +long-lived API token out of the cluster, and SSH to the HA host for host-level +work (config files, docker, add-ons). Entity state and control (`turn_on`, +`get_state`, services) stay with the MCP — *actions an MCP already encodes are +out of scope* (see top of this doc). The value here is the same as `net`/`dns`: +the non-obvious *which secret, which host, which key, which flags* you'd +otherwise re-derive every session — agents were hand-rolling a +`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on +every run because the existing `home-assistant-sofia.py` needs an env var set +and a cwd-relative path, neither of which holds in an arbitrary session. + +| Command | Tier | What it does | +|---|---|---| +| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) | +| `ha ssh [--instance sofia\|london] [-i KEY] -- ` | write | run `` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote | + +`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token` +prints the bare token to stdout so it composes in `$(…)`; it's read-tier like +`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user, +not tied to whoever first wrote the workflow (the user's key must be enrolled on +the HA host). + +### v0.8 verbs — browser (headful anti-bot automation) + +Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb) +from the devvm over CDP, for sites that detect and block headless automation. The +headless `@playwright/mcp` browser can *load* such a site and fill its forms, but +the gated action (submit/login) silently fails — the motivating case was the +Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned +`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`, +injects the same `stealth.js` the in-cluster callers use, and submits first try. + +The command owns only the *mechanics* (port-forward, stealth, lifecycle); the +agent supplies the Playwright script — judgment stays out of the CLI. + +| Command | Tier | What it does | +|---|---|---| +| `browser run [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. | +| `browser open [--shared-context] [--timeout S]` | write | open `` headful and print title + visible text + a screenshot path — a quick check. | +| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). | + +Default context is a **fresh incognito** one (closed on exit) — safe for the +shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context` +reuses the warmed persistent profile when a pre-logged-in session is needed. +`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy +that gates in-cluster callers — no namespace label needed. The node CDP client is +pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor +(Chromium 130; protocol changes between minors) and is installed once, lazily, +into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client +runs on the devvm, `setInputFiles` streams local files to the remote browser over +CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md` +and `docs/adr/0013`. + +## Build / install + +Built from source to `/usr/local/bin/homelab` during devvm provisioning +(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is +stamped from `cli/VERSION` via ldflags. Manual build: + +``` +cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab . +go test ./... +``` + +## Legacy webhook use-cases (preserved) + +This binary is also the in-cluster `infra-cli` image. Invocations starting with +`-use-case=` fall through to the +original flag-based path unchanged, so the webhook handler is unaffected. + +## Design + +See `infra/docs/adr/0004`–`0013` for the architecture decisions. diff --git a/cli/VERSION b/cli/VERSION new file mode 100644 index 00000000..85f7059b --- /dev/null +++ b/cli/VERSION @@ -0,0 +1 @@ +v0.8.1 diff --git a/cli/browser.go b/cli/browser.go new file mode 100644 index 00000000..39b6b0a0 --- /dev/null +++ b/cli/browser.go @@ -0,0 +1,388 @@ +package main + +import ( + _ "embed" + "encoding/json" + "fmt" + "io" + "net" + "net/http" + "os" + "os/exec" + "os/signal" + "path/filepath" + "strconv" + "strings" + "sync" + "syscall" + "time" +) + +// playwrightVersion pins the node CDP client to the chrome-service image minor +// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp +// speaks the browser's CDP, so the client minor must track the server minor; +// see docs/architecture/chrome-service.md "Image pin". +const playwrightVersion = "1.48.2" + +// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP +// endpoint to become ready before giving up. +const defaultBrowserTimeout = 60 + +const ( + chromeServiceNamespace = "chrome-service" + chromeServiceName = "chrome-service" + chromeServiceCDPPort = 9222 +) + +// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the +// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical +// guards against drift. +// +//go:embed browser_stealth.js +var stealthJS string + +// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint, +// installs the stealth init script, and runs the user's Playwright script. +// +//go:embed browser_runner.js +var runnerJS string + +// browserOpts is the parsed form of `homelab browser run|open` arguments. +type browserOpts struct { + mode string // "run" | "open" + script string // path to the user Playwright script (run mode) + url string // initial URL (run: optional; open: required positional) + sharedCtx bool // use the warmed persistent profile instead of a fresh context + keepOpen bool // leave the created context/pages open on exit + port int // explicit local port for the forward (0 = auto) + timeout int // CDP readiness timeout, seconds + help bool +} + +// parseBrowserArgs parses the args after `browser run` / `browser open`. +func parseBrowserArgs(mode string, args []string) (browserOpts, error) { + o := browserOpts{mode: mode, timeout: defaultBrowserTimeout} + var positionals []string + atoi := func(s, flag string) (int, error) { + n, err := strconv.Atoi(s) + if err != nil { + return 0, fmt.Errorf("%s expects an integer, got %q", flag, s) + } + return n, nil + } + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "-h" || a == "--help": + o.help = true + case a == "--shared-context": + o.sharedCtx = true + case a == "--keep-open": + o.keepOpen = true + case a == "--url": + if i+1 < len(args) { + o.url = args[i+1] + i++ + } + case strings.HasPrefix(a, "--url="): + o.url = strings.TrimPrefix(a, "--url=") + case a == "--port": + if i+1 < len(args) { + n, err := atoi(args[i+1], "--port") + if err != nil { + return o, err + } + o.port = n + i++ + } + case strings.HasPrefix(a, "--port="): + n, err := atoi(strings.TrimPrefix(a, "--port="), "--port") + if err != nil { + return o, err + } + o.port = n + case a == "--timeout": + if i+1 < len(args) { + n, err := atoi(args[i+1], "--timeout") + if err != nil { + return o, err + } + o.timeout = n + i++ + } + case strings.HasPrefix(a, "--timeout="): + n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout") + if err != nil { + return o, err + } + o.timeout = n + case strings.HasPrefix(a, "-"): + return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a) + default: + positionals = append(positionals, a) + } + } + if o.help { + return o, nil + } + switch mode { + case "run": + if len(positionals) == 0 { + return o, fmt.Errorf("usage: homelab browser run [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]") + } + o.script = positionals[0] + case "open": + if len(positionals) == 0 { + return o, fmt.Errorf("usage: homelab browser open [--shared-context] [--timeout S]") + } + o.url = positionals[0] + } + return o, nil +} + +// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is +// a real (non-headless) Chrome — the entire reason chrome-service exists. +func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) { + var v struct { + Browser string `json:"Browser"` + UserAgent string `json:"User-Agent"` + } + if e := json.Unmarshal(jsonBody, &v); e != nil { + return "", false, fmt.Errorf("parse /json/version: %w", e) + } + if v.Browser == "" { + return "", false, fmt.Errorf("/json/version had no Browser field") + } + healthy = strings.HasPrefix(v.Browser, "Chrome/") && + !strings.Contains(v.Browser, "Headless") && + !strings.Contains(v.UserAgent, "Headless") + return v.Browser, healthy, nil +} + +// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's +// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222 +// NetworkPolicy that gates in-cluster callers. +func buildPortForwardArgs(localPort int) []string { + return []string{"-n", chromeServiceNamespace, "port-forward", + "svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)} +} + +// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP +// client kept under the user cache dir. +func browserClientPackageJSON() string { + return fmt.Sprintf(`{ + "name": "homelab-browser-client", + "private": true, + "description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.", + "dependencies": { + "playwright-core": "%s" + } +} +`, playwrightVersion) +} + +// freePort asks the kernel for an unused ephemeral TCP port. +func freePort() (int, error) { + l, err := net.Listen("tcp", "127.0.0.1:0") + if err != nil { + return 0, err + } + defer l.Close() + return l.Addr().(*net.TCPAddr).Port, nil +} + +// browserClientDir is where the pinned node client + managed runner files live. +func browserClientDir() (string, error) { + cache, err := os.UserCacheDir() + if err != nil || cache == "" { + home, herr := os.UserHomeDir() + if herr != nil { + return "", fmt.Errorf("locate cache dir: %v / %v", err, herr) + } + cache = filepath.Join(home, ".cache") + } + return filepath.Join(cache, "homelab", "browser-client"), nil +} + +// installedPlaywrightVersion reads the version of the playwright-core already +// installed in dir, or "" if absent/unreadable. +func installedPlaywrightVersion(dir string) string { + b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json")) + if err != nil { + return "" + } + var v struct { + Version string `json:"version"` + } + if json.Unmarshal(b, &v) != nil { + return "" + } + return v.Version +} + +// ensureBrowserClient writes the managed runner/stealth/package files into dir +// and lazily installs the pinned playwright-core (only when missing/mismatched), +// so no per-user setup is needed and the client tracks the binary version. +func ensureBrowserClient(dir string) error { + if err := os.MkdirAll(dir, 0o755); err != nil { + return err + } + files := map[string]string{ + "package.json": browserClientPackageJSON(), + "browser_runner.js": runnerJS, + "stealth.js": stealthJS, + } + for name, content := range files { + if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil { + return err + } + } + if installedPlaywrightVersion(dir) == playwrightVersion { + return nil + } + fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion) + cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent") + cmd.Dir = dir + cmd.Stdout = os.Stderr + cmd.Stderr = os.Stderr + if err := cmd.Run(); err != nil { + return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err) + } + if got := installedPlaywrightVersion(dir); got != playwrightVersion { + return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got) + } + return nil +} + +// waitForCDP polls the local CDP endpoint until it answers as a healthy +// (non-headless) Chrome, or the timeout elapses. +func waitForCDP(cdpURL string, timeout time.Duration) (string, error) { + deadline := time.Now().Add(timeout) + client := &http.Client{Timeout: 3 * time.Second} + var lastErr error + for time.Now().Before(deadline) { + resp, err := client.Get(cdpURL + "/json/version") + if err != nil { + lastErr = err + time.Sleep(300 * time.Millisecond) + continue + } + body, _ := io.ReadAll(resp.Body) + resp.Body.Close() + browser, healthy, herr := cdpHealthy(body) + if herr != nil { + lastErr = herr + time.Sleep(300 * time.Millisecond) + continue + } + if !healthy { + return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser) + } + return browser, nil + } + if lastErr == nil { + lastErr = fmt.Errorf("timed out after %s", timeout) + } + return "", lastErr +} + +// runBrowser is the orchestration: pick a port, ensure the pinned client, start +// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node. +func runBrowser(o browserOpts) error { + port := o.port + if port == 0 { + p, err := freePort() + if err != nil { + return fmt.Errorf("pick local port: %w", err) + } + port = p + } + + dir, err := browserClientDir() + if err != nil { + return err + } + if err := ensureBrowserClient(dir); err != nil { + return err + } + + // Start the forward in its own process group so the whole tree dies on cleanup. + pf := exec.Command("kubectl", buildPortForwardArgs(port)...) + pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true} + var pfLog strings.Builder + pf.Stdout = &pfLog + pf.Stderr = &pfLog + if err := pf.Start(); err != nil { + return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err) + } + + var once sync.Once + teardown := func() { + once.Do(func() { + if pf.Process != nil { + _ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL) + } + _ = pf.Wait() + }) + } + defer teardown() + + // Tear down on Ctrl-C / SIGTERM too, then exit non-zero. + sigCh := make(chan os.Signal, 1) + signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM) + defer signal.Stop(sigCh) + go func() { + if _, ok := <-sigCh; ok { + teardown() + os.Exit(130) + } + }() + + cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port) + browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second) + if err != nil { + return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String()) + } + fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL) + + return runBrowserNode(dir, cdpURL, o) +} + +// runBrowserNode invokes the managed node runner with inputs passed via env. +func runBrowserNode(dir, cdpURL string, o browserOpts) error { + env := append(os.Environ(), + "HOMELAB_CDP_URL="+cdpURL, + "HOMELAB_BROWSER_MODE="+o.mode, + "HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"), + "NODE_PATH="+filepath.Join(dir, "node_modules"), + ) + if o.url != "" { + env = append(env, "HOMELAB_BROWSER_URL="+o.url) + } + if o.script != "" { + abs, err := filepath.Abs(o.script) + if err != nil { + return err + } + if _, err := os.Stat(abs); err != nil { + return fmt.Errorf("script %s: %w", o.script, err) + } + env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs) + } + if o.sharedCtx { + env = append(env, "HOMELAB_BROWSER_SHARED=1") + } + if o.keepOpen { + env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1") + } + if o.mode == "open" { + shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid())) + env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot) + } + cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js")) + cmd.Env = env + cmd.Stdout = os.Stdout + cmd.Stderr = os.Stderr + cmd.Stdin = os.Stdin + return cmd.Run() +} diff --git a/cli/browser_runner.js b/cli/browser_runner.js new file mode 100644 index 00000000..24a2db6b --- /dev/null +++ b/cli/browser_runner.js @@ -0,0 +1,106 @@ +// homelab browser — node CDP runner (auto-managed; regenerated each run from the +// homelab binary — DO NOT EDIT here). Connects to the port-forwarded +// chrome-service CDP endpoint, installs the stealth init script, then runs the +// user's Playwright script (run mode) or opens a URL (open mode). All inputs +// arrive via HOMELAB_* env vars set by the Go CLI. +'use strict'; +const fs = require('fs'); +const { chromium } = require('playwright-core'); + +async function main() { + const cdpURL = process.env.HOMELAB_CDP_URL; + if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set'); + const mode = process.env.HOMELAB_BROWSER_MODE || 'run'; + const stealthPath = process.env.HOMELAB_STEALTH_PATH || ''; + const initURL = process.env.HOMELAB_BROWSER_URL || ''; + const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || ''; + const shared = process.env.HOMELAB_BROWSER_SHARED === '1'; + const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1'; + const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || ''; + + const browser = await chromium.connectOverCDP(cdpURL); + + // Fresh isolated context by default (safe for the shared browser + concurrent + // callers); --shared-context reuses the warmed persistent profile. + let context; + let createdContext = false; + if (shared) { + const existing = browser.contexts(); + if (existing.length) { + context = existing[0]; + } else { + context = await browser.newContext(); + createdContext = true; + } + } else { + context = await browser.newContext(); + createdContext = true; + } + + if (stealthPath) { + const stealth = fs.readFileSync(stealthPath, 'utf8'); + if (stealth.trim()) await context.addInitScript(stealth); + } + + const page = await context.newPage(); + const log = (...a) => console.error('[browser]', ...a); + + let exitCode = 0; + try { + if (initURL) { + await page.goto(initURL, { waitUntil: 'domcontentloaded' }); + } + if (mode === 'open') { + console.log('url: ' + page.url()); + console.log('title: ' + (await page.title())); + const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim(); + console.log('--- visible text (truncated to 4000 chars) ---'); + console.log(text.slice(0, 4000)); + if (screenshotPath) { + await page.screenshot({ path: screenshotPath, fullPage: true }); + console.log('screenshot: ' + screenshotPath); + } + } else { + if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT'); + const src = fs.readFileSync(scriptPath, 'utf8'); + // Run the user's source with page/context/browser/log in lexical scope. + // AsyncFunction body permits top-level await. + const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor; + const fn = new AsyncFunction('page', 'context', 'browser', 'log', src); + const result = await fn(page, context, browser, log); + if (result !== undefined) { + let out; + try { + out = typeof result === 'string' ? result : JSON.stringify(result, null, 2); + } catch (_) { + out = String(result); + } + console.log(out); + } + } + } catch (e) { + console.error('homelab browser: script error:', e && e.stack ? e.stack : e); + exitCode = 1; + } finally { + if (!keepOpen) { + try { + // Close only what we created; never tear down the shared persistent context. + if (createdContext) { + await context.close(); + } else { + await page.close(); + } + } catch (_) { /* ignore */ } + } + // Disconnect from the CDP endpoint; this does NOT kill the remote browser. + try { + await browser.close(); + } catch (_) { /* ignore */ } + } + process.exit(exitCode); +} + +main().catch((e) => { + console.error('homelab browser: fatal:', e && e.stack ? e.stack : e); + process.exit(1); +}); diff --git a/cli/browser_stealth.js b/cli/browser_stealth.js new file mode 100644 index 00000000..dfae98a8 --- /dev/null +++ b/cli/browser_stealth.js @@ -0,0 +1,54 @@ +// Minimal stealth init script for Playwright-driven Chromium. +// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers: +// webdriver, chrome.runtime, navigator.plugins, navigator.languages, +// Permissions.query, WebGL getParameter (vendor + renderer spoof). +// Run via context.add_init_script() so it executes before any page script. +(() => { + // navigator.webdriver — most common detection, removed entirely. + Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined }); + + // window.chrome.runtime — many sites check that real Chrome exposes this. + if (!window.chrome) window.chrome = {}; + window.chrome.runtime = window.chrome.runtime || {}; + + // navigator.plugins — headless reports zero; spoof a plausible PDF viewer. + Object.defineProperty(navigator, 'plugins', { + get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }], + }); + + // navigator.languages — headless returns empty array. + Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] }); + + // Permissions.query — headless returns 'denied' for notifications instead of 'default'. + const origQuery = window.navigator.permissions && window.navigator.permissions.query; + if (origQuery) { + window.navigator.permissions.query = (parameters) => + parameters && parameters.name === 'notifications' + ? Promise.resolve({ state: Notification.permission }) + : origQuery(parameters); + } + + // WebGL getParameter — spoof vendor + renderer strings to a real GPU. + const spoofGl = (proto) => { + if (!proto) return; + const orig = proto.getParameter; + proto.getParameter = function (parameter) { + if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL + if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL + return orig.apply(this, arguments); + }; + }; + spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype); + spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype); + + // disable-devtool.js (theajack/disable-devtool) auto-inits via a script + // tag with `disable-devtool-auto`. Its Performance detector trips under + // Playwright (CDP adds console.log latency vs console.table) and the + // redirect URL is hard-coded — for hmembeds that's google.com. + // Hide the auto-init marker so the library's IIFE exits early. + const origQS = Document.prototype.querySelector; + Document.prototype.querySelector = function (sel) { + if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null; + return origQS.apply(this, arguments); + }; +})(); diff --git a/cli/cmd_browser.go b/cli/cmd_browser.go new file mode 100644 index 00000000..4263e4d0 --- /dev/null +++ b/cli/cmd_browser.go @@ -0,0 +1,117 @@ +package main + +import "fmt" + +// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP +// from outside the cluster, for sites that detect/block headless automation. +// The headless @playwright/mcp browser can load such sites but their gated +// actions (submit/login) silently fail; this path submits first try. Mechanics +// only — the agent supplies the Playwright script. See docs/adr/0013. + +func browserCommands() []Command { + return []Command{ + {Path: []string{"browser"}, Tier: TierRead, + Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp}, + {Path: []string{"browser", "run"}, Tier: TierWrite, + Summary: "run a Playwright script against headful cluster Chrome: browser run [--url U] [--shared-context]", Run: browserRun}, + {Path: []string{"browser", "open"}, Tier: TierWrite, + Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open ", Run: browserOpen}, + } +} + +func browserTopHelp([]string) error { + fmt.Print(browserHelp()) + return nil +} + +func browserRun(args []string) error { + o, err := parseBrowserArgs("run", args) + if err != nil { + return err + } + if o.help { + fmt.Print(browserHelp()) + return nil + } + return runBrowser(o) +} + +func browserOpen(args []string) error { + o, err := parseBrowserArgs("open", args) + if err != nil { + return err + } + if o.help { + fmt.Print(browserHelp()) + return nil + } + return runBrowser(o) +} + +// browserHelp carries the discoverability payload: WHEN to reach for this, and +// the diagnostic cheat-sheet that lets the agent self-correct instead of +// retrying a deterministic form blind (the failure mode that motivated this). +func browserHelp() string { + return `homelab browser — drive the cluster's HEADFUL Chrome (anti-bot) over CDP + +The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under +Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp, +injects the same stealth.js the in-cluster callers use, and runs your script. + +USAGE + homelab browser run [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S] + homelab browser open [--shared-context] [--timeout S] + +WHEN TO USE THIS — escalation only; DEFAULT to the headless/MCP browser + Default to the Playwright MCP / headless browser for ALL routine browsing and + automation — it's interactive (snapshot per step), fast to start, isolated. + Reach for THIS command ONLY when headless is demonstrably blocked: a site + LOADS fine but a gated action FAILS or HANGS — a submit/login/checkout spins + forever, or ONE request errors while its siblings 200. That is the signature + of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome", + disable-devtool traps). It presents as a real Chrome and usually succeeds + first try — but it's the shared cluster browser (slower startup, one batch + run, no per-step feedback), so it's the escalation path, never the default. + +ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying) + ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the + automation layer — NOT a network/egress problem. + (This is what silently broke the headless submit.) + ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also + ERR_TIMED_OUT / break the initial page load — if the page loaded, + ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere. + one endpoint 500s while server-side bot rejection of the automation, not + its siblings 200 your payload. + +HABITS + - Inspect the network panel BEFORE retrying a deterministic form; a blind + retry just repeats the same silent failure. + - Don't park a half-filled multi-step form across a user pause — the session + can expire; re-run the whole flow from this command in one shot. + - Uploads stream over CDP via setInputFiles from THIS host — no chmod/staging + of $HOME needed; just point setInputFiles at a local path. + +CONTEXT + Default: a FRESH incognito context, closed on exit — safe for the shared + browser and concurrent callers (e.g. tripit). Your script does its own login. + --shared-context: reuse the warmed PERSISTENT profile (cookies from a manual + noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session. + +SCRIPT CONTRACT (run mode) + Your file's body runs with page, context, browser and log() already in scope + (top-level await allowed). Return a value to print it. Example flow.js: + + await page.goto('https://portal.example.com/login'); + await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW); + await page.click('button[type=submit]'); + await page.waitForURL('**/dashboard'); + return 'logged in: ' + page.url(); + + Run it: homelab browser run flow.js + +NOTES + - The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the + chrome-service image (Chrome 130); installed once into ~/.cache/homelab/. + - The port-forward is always torn down, on success and on error. +` +} diff --git a/cli/cmd_browser_test.go b/cli/cmd_browser_test.go new file mode 100644 index 00000000..668897d3 --- /dev/null +++ b/cli/cmd_browser_test.go @@ -0,0 +1,172 @@ +package main + +import ( + "os" + "reflect" + "strings" + "testing" +) + +func TestParseBrowserArgsRun(t *testing.T) { + got, err := parseBrowserArgs("run", []string{ + "flow.js", "--url", "https://example.com", "--shared-context", + "--port", "19999", "--timeout", "45", "--keep-open", + }) + if err != nil { + t.Fatalf("parseBrowserArgs run: unexpected err: %v", err) + } + want := browserOpts{ + mode: "run", script: "flow.js", url: "https://example.com", + sharedCtx: true, keepOpen: true, port: 19999, timeout: 45, + } + if !reflect.DeepEqual(got, want) { + t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want) + } +} + +func TestParseBrowserArgsRunDefaults(t *testing.T) { + got, err := parseBrowserArgs("run", []string{"flow.js"}) + if err != nil { + t.Fatalf("unexpected err: %v", err) + } + if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 { + t.Fatalf("defaults wrong: %+v", got) + } + if got.timeout != defaultBrowserTimeout { + t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout) + } +} + +func TestParseBrowserArgsRunRequiresScript(t *testing.T) { + if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil { + t.Fatalf("run without a script path should error") + } +} + +func TestParseBrowserArgsOpenRequiresURL(t *testing.T) { + got, err := parseBrowserArgs("open", []string{"https://example.com"}) + if err != nil { + t.Fatalf("unexpected err: %v", err) + } + if got.url != "https://example.com" || got.mode != "open" { + t.Fatalf("open parse wrong: %+v", got) + } + if _, err := parseBrowserArgs("open", []string{}); err == nil { + t.Fatalf("open without a URL should error") + } +} + +func TestParseBrowserArgsHelp(t *testing.T) { + for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} { + got, err := parseBrowserArgs("run", a) + if err != nil { + t.Fatalf("help parse %v: %v", a, err) + } + if !got.help { + t.Fatalf("args %v should set help", a) + } + } +} + +func TestParseBrowserArgsEqualsForm(t *testing.T) { + got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"}) + if err != nil { + t.Fatalf("unexpected err: %v", err) + } + if got.url != "https://x" || got.port != 8123 || got.timeout != 10 { + t.Fatalf("--flag=value form not parsed: %+v", got) + } +} + +func TestCDPHealthy(t *testing.T) { + real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`) + browser, ok, err := cdpHealthy(real) + if err != nil || !ok { + t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err) + } + if !strings.HasPrefix(browser, "Chrome/") { + t.Fatalf("browser = %q, want Chrome/ prefix", browser) + } + + headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`) + if _, ok, _ := cdpHealthy(headless); ok { + t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)") + } + + if _, _, err := cdpHealthy([]byte("not json")); err == nil { + t.Fatalf("malformed /json/version body should error") + } +} + +func TestBuildPortForwardArgs(t *testing.T) { + got := buildPortForwardArgs(18080) + want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"} + if !reflect.DeepEqual(got, want) { + t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want) + } +} + +func TestBrowserClientPackageJSONPinsVersion(t *testing.T) { + pj := browserClientPackageJSON() + if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) { + t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj) + } +} + +func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) { + // chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP + // client minor MUST match (protocol changes between minors). + if !strings.HasPrefix(playwrightVersion, "1.48.") { + t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion) + } +} + +func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) { + h := browserHelp() + for _, want := range []string{ + "homelab browser run", + "ERR_FILE_NOT_FOUND", + "ERR_CONNECTION_REFUSED", + "network panel", + "headless", + "--shared-context", + } { + if !strings.Contains(h, want) { + t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want) + } + } +} + +func TestBrowserHelpIsTiered(t *testing.T) { + // --help must frame this as the ESCALATION path (default to headless first), + // matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent + // instructions. Guard against a regression to "co-equal choice" wording. + h := browserHelp() + for _, want := range []string{"Default to the", "escalation"} { + if !strings.Contains(h, want) { + t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want) + } + } +} + +func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) { + // The embedded copy must never drift from the source of truth that the + // in-cluster callers use, else the CLI's stealth and the cluster's diverge. + canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js") + if err != nil { + t.Fatalf("read canonical stealth.js: %v", err) + } + if stealthJS != string(canonical) { + t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it") + } +} + +func TestFreePortReturnsUsablePort(t *testing.T) { + p, err := freePort() + if err != nil { + t.Fatalf("freePort: %v", err) + } + if p <= 1024 || p > 65535 { + t.Fatalf("freePort returned %d, want an ephemeral port", p) + } +} diff --git a/cli/cmd_ci.go b/cli/cmd_ci.go new file mode 100644 index 00000000..66d4902d --- /dev/null +++ b/cli/cmd_ci.go @@ -0,0 +1,99 @@ +package main + +import ( + "fmt" + "os" + "strings" + "time" +) + +func ciCommands() []Command { + return []Command{ + {Path: []string{"ci", "status"}, Tier: TierRead, + Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus}, + {Path: []string{"ci", "watch"}, Tier: TierRead, + Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch}, + } +} + +func short(s string) string { + if len(s) > 8 { + return s[:8] + } + return s +} + +func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] } + +// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo). +func currentHEAD() string { + cwd, _ := os.Getwd() + root, err := gitRepoRoot(cwd) + if err != nil { + return "" + } + sha, _ := gitOutput(root, "rev-parse", "HEAD") + return sha +} + +func ciStatus(args []string) error { + commit, _ := firstPositional(args) + c, err := newWPClient() + if err != nil { + return err + } + id, err := c.repoID() + if err != nil { + return err + } + p, err := c.findPipeline(id, commit) + if err != nil { + return err + } + fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message)) + return nil +} + +func ciWatch(args []string) error { + commit, _ := firstPositional(args) + if commit == "" { + commit = currentHEAD() + } + if commit == "" { + return fmt.Errorf("no commit given and not in a git repo") + } + c, err := newWPClient() + if err != nil { + return err + } + id, err := c.repoID() + if err != nil { + return err + } + timeout := 20 * time.Minute + deadline := time.Now().Add(timeout) + last := "" + for time.Now().Before(deadline) { + p, err := c.findPipeline(id, commit) + if err != nil { + if last != "waiting" { + fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit)) + last = "waiting" + } + } else { + if p.Status != last { + fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status) + last = p.Status + } + if isTerminalStatus(p.Status) { + fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit)) + if isFailureStatus(p.Status) { + return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status) + } + return nil + } + } + time.Sleep(15 * time.Second) + } + return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit)) +} diff --git a/cli/cmd_claim.go b/cli/cmd_claim.go new file mode 100644 index 00000000..e11a37db --- /dev/null +++ b/cli/cmd_claim.go @@ -0,0 +1,56 @@ +package main + +import ( + "fmt" + "strings" +) + +func claimCommands() []Command { + return []Command{ + {Path: []string{"claim"}, Tier: TierWrite, + Summary: "claim a shared infra resource on the presence board", + Run: runClaim}, + {Path: []string{"release"}, Tier: TierWrite, + Summary: "release a presence claim", + Run: runRelease}, + } +} + +// runClaim parses `: --purpose "..."` in either order (the presence +// script takes the label first, so we can't rely on Go's flag package which +// stops at the first positional). +func runClaim(args []string) error { + var label, purpose string + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--purpose" || a == "-purpose": + if i+1 < len(args) { + purpose = args[i+1] + i++ + } + case strings.HasPrefix(a, "--purpose="): + purpose = strings.TrimPrefix(a, "--purpose=") + case !strings.HasPrefix(a, "-") && label == "": + label = a + } + } + if label == "" { + return fmt.Errorf(`usage: homelab claim : --purpose "what + why"`) + } + return presenceClaim(label, purpose) +} + +func runRelease(args []string) error { + var label string + for _, a := range args { + if !strings.HasPrefix(a, "-") { + label = a + break + } + } + if label == "" { + return fmt.Errorf("usage: homelab release :") + } + return presenceRelease(label) +} diff --git a/cli/cmd_deploy.go b/cli/cmd_deploy.go new file mode 100644 index 00000000..d5afc4a8 --- /dev/null +++ b/cli/cmd_deploy.go @@ -0,0 +1,51 @@ +package main + +import ( + "fmt" + "os" + "strings" + "time" +) + +func deployCommands() []Command { + return []Command{ + {Path: []string{"deploy", "wait"}, Tier: TierRead, + Summary: "wait for / to roll out the current (or --sha) image: deploy wait / [--sha SHA]", Run: deployWait}, + } +} + +// deployWait closes the "did the NEW code land" gap: rollout status alone returns +// success on the OLD ReplicaSet, so we first wait for the deployment image to +// reference the expected sha, THEN block on rollout status. +func deployWait(args []string) error { + target, _ := firstPositional(args) + if target == "" || !strings.Contains(target, "/") { + return fmt.Errorf("usage: homelab deploy wait / [--sha SHA] [--timeout 10m]") + } + parts := strings.SplitN(target, "/", 2) + ns, deploy := parts[0], parts[1] + + sha := flagValue(args, "--sha") + if sha == "" { + sha = short(currentHEAD()) + } + deadline := time.Now().Add(10 * time.Minute) + + if sha != "" { + fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha) + matched := false + for time.Now().Before(deadline) { + img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}") + if strings.Contains(img, sha) { + matched = true + break + } + time.Sleep(10 * time.Second) + } + if !matched { + return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha) + } + } + fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy) + return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s") +} diff --git a/cli/cmd_ha.go b/cli/cmd_ha.go new file mode 100644 index 00000000..2309bdfc --- /dev/null +++ b/cli/cmd_ha.go @@ -0,0 +1,172 @@ +package main + +import ( + "encoding/base64" + "fmt" + "os" + "path/filepath" + "strings" +) + +// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving +// the long-lived API token out of the cluster, and SSH to the HA host for +// host-level work (config files, docker, add-ons). Entity state/control stays +// with the MCP — see docs/adr/0012. +// +// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per +// instance), split out of openclaw-secrets so non-admin operators (emo / "Home +// Server Admins") can read JUST the HA token, not the full skill_secrets blob. +// `ha token` resolves it on demand via the ambient kubeconfig, so it never +// depends on a pre-set env var (the gap that made agents re-derive the +// kubectl|base64|jq pipeline every session). + +type haInstance struct { + name string // sofia | london + sshUser string // SSH login on the HA host + sshHost string // host reachable from the devvm (Sofia LAN) + secretKey string // key inside the openclaw/ha-tokens Secret holding this token +} + +const ( + haDefaultInstance = "sofia" + haSecretNamespace = "openclaw" + haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf +) + +// haInstances maps instance name → connection/secret facts. sofia is the default +// because the devvm is on the Sofia LAN; london is documented but its host +// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london` +// generally won't connect from here (token resolution still works). +var haInstances = map[string]haInstance{ + "sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"}, + "london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"}, +} + +func haCommands() []Command { + return []Command{ + {Path: []string{"ha", "token"}, Tier: TierRead, + Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken}, + {Path: []string{"ha", "ssh"}, Tier: TierWrite, + Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- ", Run: haSSH}, + } +} + +// resolveHAInstance looks up an instance by name; "" yields the default (sofia). +func resolveHAInstance(name string) (haInstance, error) { + if name == "" { + name = haDefaultInstance + } + inst, ok := haInstances[name] + if !ok { + return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name) + } + return inst, nil +} + +// decodeSecretValue base64-decodes a k8s Secret `.data.` value as returned +// by kubectl jsonpath (trailing whitespace tolerated). +func decodeSecretValue(b64 string) (string, error) { + raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64)) + if err != nil { + return "", fmt.Errorf("base64-decode secret value: %w", err) + } + return string(raw), nil +} + +func haToken(args []string) error { + name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia` + for i := 0; i < len(args); i++ { + if args[i] == "--instance" && i+1 < len(args) { + name = args[i+1] + } else if strings.HasPrefix(args[i], "--instance=") { + name = strings.TrimPrefix(args[i], "--instance=") + } + } + inst, err := resolveHAInstance(name) + if err != nil { + return err + } + b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName, + "-o", "jsonpath={.data."+inst.secretKey+"}") + if err != nil { + return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err) + } + if b64 == "" { + return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey) + } + tok, err := decodeSecretValue(b64) + if err != nil { + return err + } + fmt.Println(tok) + return nil +} + +// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user +// rather than tied to whoever first wrote the workflow. +func defaultHAKeyPath() string { + if home, err := os.UserHomeDir(); err == nil && home != "" { + return filepath.Join(home, ".ssh", "id_ed25519") + } + return filepath.Join("~", ".ssh", "id_ed25519") +} + +// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] `. Tokens after +// `--` are taken verbatim; bare tokens before it are also the remote command. +func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) { + name := haDefaultInstance + keyPath = defaultHAKeyPath() + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--": + remote = append(remote, args[i+1:]...) + i = len(args) + case a == "--instance": + if i+1 < len(args) { + name = args[i+1] + i++ + } + case strings.HasPrefix(a, "--instance="): + name = strings.TrimPrefix(a, "--instance=") + case a == "--key" || a == "-i": + if i+1 < len(args) { + keyPath = args[i+1] + i++ + } + case strings.HasPrefix(a, "--key="): + keyPath = strings.TrimPrefix(a, "--key=") + default: + remote = append(remote, a) + } + } + inst, err = resolveHAInstance(name) + return inst, keyPath, remote, err +} + +// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit +// key, no user ssh config, and no known_hosts prompt/record — so it runs +// unattended in an agent session without hanging on a host-key prompt. +func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string { + args := []string{ + "-F", "/dev/null", + "-o", "IdentityFile=" + keyPath, + "-o", "StrictHostKeyChecking=no", + "-o", "UserKnownHostsFile=/dev/null", + "-o", "ConnectTimeout=10", + "-o", "BatchMode=yes", + inst.sshUser + "@" + inst.sshHost, + } + return append(args, remote...) +} + +func haSSH(args []string) error { + inst, keyPath, remote, err := parseHASSH(args) + if err != nil { + return err + } + if len(remote) == 0 { + return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- `) + } + return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...) +} diff --git a/cli/cmd_ha_test.go b/cli/cmd_ha_test.go new file mode 100644 index 00000000..9dc10e11 --- /dev/null +++ b/cli/cmd_ha_test.go @@ -0,0 +1,92 @@ +package main + +import ( + "encoding/base64" + "reflect" + "strings" + "testing" +) + +func TestResolveHAInstance(t *testing.T) { + // empty defaults to sofia (the devvm sits on the Sofia LAN) + if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" { + t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err) + } + if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" { + t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err) + } + if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" { + t.Fatalf("london = %+v, %v", got, err) + } + if _, err := resolveHAInstance("paris"); err == nil { + t.Fatalf("resolveHAInstance(paris) should error on unknown instance") + } +} + +func TestDecodeSecretValue(t *testing.T) { + // k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.}` + // returns that base64, which decodeSecretValue turns back into the raw token. + enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia")) + if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" { + t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err) + } + // trailing whitespace/newline from jsonpath output must be tolerated + if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" { + t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err) + } + if _, err := decodeSecretValue("not-base64!!"); err == nil { + t.Fatalf("decodeSecretValue should error on undecodable base64") + } +} + +func TestBuildHASSHArgs(t *testing.T) { + inst, _ := resolveHAInstance("sofia") + got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"}) + want := []string{ + "-F", "/dev/null", + "-o", "IdentityFile=/home/u/.ssh/id_ed25519", + "-o", "StrictHostKeyChecking=no", + "-o", "UserKnownHostsFile=/dev/null", + "-o", "ConnectTimeout=10", + "-o", "BatchMode=yes", + "vbarzin@192.168.1.8", + "cat", "/config/configuration.yaml", + } + if !reflect.DeepEqual(got, want) { + t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want) + } +} + +func TestParseHASSH(t *testing.T) { + // instance flag + everything after `--` is the verbatim remote command + inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"}) + if err != nil { + t.Fatalf("parseHASSH err: %v", err) + } + if inst.name != "sofia" { + t.Errorf("instance = %q, want sofia", inst.name) + } + if !strings.HasSuffix(key, "/.ssh/id_ed25519") { + t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key) + } + if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) { + t.Errorf("remote = %v, want [docker ps -a]", remote) + } + + // bare args (no `--`) are also taken as the remote command; -i overrides the key + _, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"}) + if err != nil { + t.Fatalf("parseHASSH err: %v", err) + } + if key2 != "/tmp/k" { + t.Errorf("key = %q, want /tmp/k", key2) + } + if !reflect.DeepEqual(remote2, []string{"uptime"}) { + t.Errorf("remote = %v, want [uptime]", remote2) + } + + // unknown instance surfaces as an error + if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil { + t.Errorf("parseHASSH should error on unknown instance") + } +} diff --git a/cli/cmd_k8s.go b/cli/cmd_k8s.go new file mode 100644 index 00000000..80f8f62d --- /dev/null +++ b/cli/cmd_k8s.go @@ -0,0 +1,288 @@ +package main + +import ( + "fmt" + "os" + "strings" +) + +func k8sCommands() []Command { + return []Command{ + {Path: []string{"k8s", "status"}, Tier: TierRead, + Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus}, + {Path: []string{"k8s", "get"}, Tier: TierRead, + Summary: "kubectl get in a namespace: k8s get [args]", Run: k8sGet}, + {Path: []string{"k8s", "logs"}, Tier: TierRead, + Summary: "logs for (deploy/; --tail/-c/--previous/--since/-l)", Run: k8sLogs}, + {Path: []string{"k8s", "describe"}, Tier: TierRead, + Summary: "describe 's deployment (or an explicit resource)", Run: k8sDescribe}, + {Path: []string{"k8s", "debug"}, Tier: TierRead, + Summary: "one-shot triage for : pods+deploy+describe+logs+events", Run: k8sDebug}, + {Path: []string{"k8s", "pf"}, Tier: TierRead, + Summary: "port-forward: k8s pf [svc/pod target]", Run: k8sPortForward}, + {Path: []string{"k8s", "db"}, Tier: TierWrite, + Summary: `query a dbaas DB: k8s db [--mysql] [--db N] -- ""`, Run: k8sDB}, + {Path: []string{"k8s", "exec"}, Tier: TierWrite, + Summary: "exec in 's pod: k8s exec [--tty] -- ", Run: k8sExec}, + {Path: []string{"k8s", "rm-pod"}, Tier: TierWrite, + Summary: "delete a stuck pod/job ONLY: k8s rm-pod -n [--job] [--force]", Run: k8sRmPod}, + {Path: []string{"k8s", "rollout-status"}, Tier: TierRead, + Summary: "rollout status of deploy/", Run: k8sRolloutStatus}, + {Path: []string{"k8s", "restart"}, Tier: TierWrite, + Summary: "rollout restart deploy/ then wait for status", Run: k8sRestart}, + {Path: []string{"k8s", "probe"}, Tier: TierRead, + Summary: "in-cluster reachability: ephemeral curl pod to ..svc", Run: k8sProbe}, + } +} + +func k8sStatus(args []string) error { + t := parseK8sTarget(args) + ns := t.namespace() // "" when no app/ns given → cluster-wide + get := []string{"get", "pods", "-o", "wide"} + ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"} + if ns == "" { + get = append(get, "-A") + ev = append(ev, "-A") + } + if err := kubectlStream(ns, get...); err != nil { + return err + } + fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---") + _ = kubectlStream(ns, ev...) // best-effort + return nil +} + +func k8sGet(args []string) error { + t := parseK8sTarget(args) + if t.app == "" || len(t.rest) == 0 { + return fmt.Errorf("usage: homelab k8s get [args]") + } + return kubectlStream(t.app, append([]string{"get"}, t.rest...)...) +} + +func k8sLogs(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s logs [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]") + } + a := []string{"logs"} + if t.selector != "" { + a = append(a, "-l", t.selector) + } else { + a = append(a, t.objectRef()) + } + if t.container != "" { + a = append(a, "-c", t.container) + } + if !containsPrefix(t.rest, "--tail") { + a = append(a, "--tail=200") + } + a = append(a, t.rest...) + return kubectlStream(t.namespace(), a...) +} + +func k8sDescribe(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s describe [resource]") + } + if len(t.rest) > 0 { + return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...) + } + return kubectlStream(t.namespace(), "describe", t.objectRef()) +} + +func k8sDebug(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s debug ") + } + ns := t.namespace() + sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) } + sec("pods") + _ = kubectlStream(ns, "get", "pods", "-o", "wide") + sec("workloads") + _ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide") + sec("describe "+t.objectRef()) + _ = kubectlStream(ns, "describe", t.objectRef()) + sec("recent logs (--tail=50)") + _ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50") + sec("events (type!=Normal)") + _ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp") + return nil +} + +func k8sPortForward(args []string) error { + t := parseK8sTarget(args) + if t.app == "" || len(t.rest) == 0 { + return fmt.Errorf("usage: homelab k8s pf [svc/pod target]") + } + ports := t.rest[0] + target := "svc/" + t.app + if len(t.rest) > 1 { + target = t.rest[1] + } + return kubectlStream(t.namespace(), "port-forward", target, ports) +} + +func k8sDB(args []string) error { + var app, dbName, sql string + mysql := false + for i := 0; i < len(args); i++ { + a := args[i] + if a == "--" { + sql = strings.Join(args[i+1:], " ") + break + } + switch { + case a == "--mysql": + mysql = true + case a == "--db": + if i+1 < len(args) { + dbName = args[i+1] + i++ + } + case strings.HasPrefix(a, "--db="): + dbName = strings.TrimPrefix(a, "--db=") + case !strings.HasPrefix(a, "-") && app == "": + app = a + } + } + if app == "" { + return fmt.Errorf(`usage: homelab k8s db [--mysql] [--db NAME] -- ""`) + } + p := planDBExec(app, dbName, sql, mysql) + pod := p.pod + if pod == "" && p.selector != "" { + resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}") + if err != nil || resolved == "" { + return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err) + } + pod = resolved + } + exec := []string{"exec"} + if sql == "" { + exec = append(exec, "-it") // interactive client when no SQL given + } + exec = append(exec, pod) + if p.container != "" { + exec = append(exec, "-c", p.container) + } + exec = append(exec, "--") + exec = append(exec, p.argv...) + return kubectlStream(p.ns, exec...) +} + +func k8sExec(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s exec [--pod p] [-c ctr] [--tty] -- ") + } + if len(t.rest) == 0 { + return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app) + } + a := []string{"exec"} + if t.tty { + a = append(a, "-it") + } + a = append(a, t.objectRef()) + if t.container != "" { + a = append(a, "-c", t.container) + } + a = append(a, "--") + a = append(a, t.rest...) + return kubectlStream(t.namespace(), a...) +} + +func k8sRmPod(args []string) error { + var pod, ns, grace string + force, job := false, false + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "-n" || a == "--namespace": + if i+1 < len(args) { + ns = args[i+1] + i++ + } + case a == "--force": + force = true + case a == "--job": + job = true + case a == "--grace": + if i+1 < len(args) { + grace = args[i+1] + i++ + } + case !strings.HasPrefix(a, "-") && pod == "": + pod = a + } + } + if pod == "" || ns == "" { + return fmt.Errorf("usage: homelab k8s rm-pod -n [--job] [--force] [--grace N] (pods/jobs only)") + } + kind := "pod" + if job { + kind = "job" + } + a := []string{"delete", kind, pod} + if grace != "" { + a = append(a, "--grace-period="+grace) + } + if force { + a = append(a, "--force") + } + return kubectlStream(ns, a...) +} + +func k8sRolloutStatus(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s rollout-status ") + } + return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app) +} + +func k8sRestart(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s restart ") + } + ns := t.namespace() + if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil { + return err + } + return kubectlStream(ns, "rollout", "status", "deploy/"+t.app) +} + +func k8sProbe(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s probe [path] [--port N]") + } + ns := t.namespace() + url := "http://" + t.app + "." + ns + ".svc.cluster.local" + if port := flagValue(args, "--port"); port != "" { + url += ":" + port + } + if len(t.rest) > 0 { + p := t.rest[0] + if !strings.HasPrefix(p, "/") { + p = "/" + p + } + url += p + } + return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never", + "--image=curlimages/curl:latest", "--", + "curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url) +} + +// containsPrefix reports whether any arg starts with prefix. +func containsPrefix(args []string, prefix string) bool { + for _, a := range args { + if strings.HasPrefix(a, prefix) { + return true + } + } + return false +} diff --git a/cli/cmd_memory.go b/cli/cmd_memory.go new file mode 100644 index 00000000..94f3a482 --- /dev/null +++ b/cli/cmd_memory.go @@ -0,0 +1,302 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/url" + "strings" +) + +func memoryCommands() []Command { + return []Command{ + {Path: []string{"memory", "recall"}, Tier: TierRead, + Summary: `semantic search of memory: memory recall "" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall}, + {Path: []string{"memory", "list"}, Tier: TierRead, + Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList}, + {Path: []string{"memory", "categories"}, Tier: TierRead, + Summary: "list memory categories", Run: memorySimpleGet("/api/categories")}, + {Path: []string{"memory", "tags"}, Tier: TierRead, + Summary: "list memory tags", Run: memorySimpleGet("/api/tags")}, + {Path: []string{"memory", "stats"}, Tier: TierRead, + Summary: "memory store stats", Run: memorySimpleGet("/api/stats")}, + {Path: []string{"memory", "secret"}, Tier: TierRead, + Summary: "reveal a sensitive memory's content: memory secret ", Run: memorySecret}, + {Path: []string{"memory", "store"}, Tier: TierWrite, + Summary: `store a memory: memory store "" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore}, + {Path: []string{"memory", "update"}, Tier: TierWrite, + Summary: "update a memory: memory update [--content --tags --importance --keywords]", Run: memoryUpdate}, + {Path: []string{"memory", "delete"}, Tier: TierWrite, + Summary: "delete a memory: memory delete ", Run: memoryDelete}, + } +} + +// printMemories renders a {memories:[…]} response as compact lines, or raw JSON. +func printMemories(raw []byte, jsonOut bool) error { + if jsonOut { + fmt.Println(string(raw)) + return nil + } + var r struct { + Memories []struct { + ID int `json:"id"` + Content string `json:"content"` + Category string `json:"category"` + Tags string `json:"tags"` + Importance float64 `json:"importance"` + } `json:"memories"` + } + if err := json.Unmarshal(raw, &r); err != nil { + fmt.Println(string(raw)) + return nil + } + if len(r.Memories) == 0 { + fmt.Println("(no memories)") + return nil + } + for _, m := range r.Memories { + c := strings.ReplaceAll(m.Content, "\n", " ") + if len(c) > 240 { + c = c[:240] + "…" + } + fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) + if m.Tags != "" { + fmt.Printf(" tags: %s\n", m.Tags) + } + } + return nil +} + +func memoryRecall(args []string) error { + req := memRecallReq{} + jsonOut := false + var pos []string + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--query": + if i+1 < len(args) { + req.ExpandedQuery = args[i+1] + i++ + } + case a == "--category": + if i+1 < len(args) { + req.Category = args[i+1] + i++ + } + case a == "--sort": + if i+1 < len(args) { + req.SortBy = args[i+1] + i++ + } + case a == "--limit": + if i+1 < len(args) { + fmt.Sscanf(args[i+1], "%d", &req.Limit) + i++ + } + case a == "--json": + jsonOut = true + case !strings.HasPrefix(a, "-"): + pos = append(pos, a) + } + } + req.Context = strings.Join(pos, " ") + if req.Context == "" { + return fmt.Errorf(`usage: homelab memory recall "" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`) + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("POST", "/api/memories/recall", req) + if err != nil { + return err + } + return printMemories(raw, jsonOut) +} + +func memoryList(args []string) error { + q := url.Values{} + jsonOut := false + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--category": + if i+1 < len(args) { + q.Set("category", args[i+1]) + i++ + } + case a == "--tag": + if i+1 < len(args) { + q.Set("tag", args[i+1]) + i++ + } + case a == "--limit": + if i+1 < len(args) { + q.Set("limit", args[i+1]) + i++ + } + case a == "--json": + jsonOut = true + } + } + c, err := newMemoryClient() + if err != nil { + return err + } + path := "/api/memories" + if len(q) > 0 { + path += "?" + q.Encode() + } + raw, err := c.do("GET", path, nil) + if err != nil { + return err + } + return printMemories(raw, jsonOut) +} + +func memorySimpleGet(path string) func([]string) error { + return func(args []string) error { + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("GET", path, nil) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil + } +} + +func memorySecret(args []string) error { + id, _ := firstPositional(args) + if id == "" { + return fmt.Errorf("usage: homelab memory secret ") + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil +} + +func memoryStore(args []string) error { + req := memStoreReq{Category: "facts", Importance: 0.5} + var pos []string + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--category": + if i+1 < len(args) { + req.Category = args[i+1] + i++ + } + case a == "--tags": + if i+1 < len(args) { + req.Tags = args[i+1] + i++ + } + case a == "--keywords": + if i+1 < len(args) { + req.ExpandedKeywords = args[i+1] + i++ + } + case a == "--importance": + if i+1 < len(args) { + fmt.Sscanf(args[i+1], "%f", &req.Importance) + i++ + } + case a == "--sensitive": + req.ForceSensitive = true + case !strings.HasPrefix(a, "-"): + pos = append(pos, a) + } + } + req.Content = strings.Join(pos, " ") + if req.Content == "" { + return fmt.Errorf(`usage: homelab memory store "" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`) + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("POST", "/api/memories", req) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil +} + +func memoryUpdate(args []string) error { + var id string + req := memUpdateReq{} + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--content": + if i+1 < len(args) { + v := args[i+1] + req.Content = &v + i++ + } + case a == "--tags": + if i+1 < len(args) { + v := args[i+1] + req.Tags = &v + i++ + } + case a == "--keywords": + if i+1 < len(args) { + v := args[i+1] + req.ExpandedKeywords = &v + i++ + } + case a == "--importance": + if i+1 < len(args) { + var f float64 + fmt.Sscanf(args[i+1], "%f", &f) + req.Importance = &f + i++ + } + case !strings.HasPrefix(a, "-") && id == "": + id = a + } + } + if id == "" { + return fmt.Errorf("usage: homelab memory update [--content ...] [--tags ...] [--importance N] [--keywords ...]") + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("PUT", "/api/memories/"+id, req) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil +} + +func memoryDelete(args []string) error { + id, _ := firstPositional(args) + if id == "" { + return fmt.Errorf("usage: homelab memory delete ") + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("DELETE", "/api/memories/"+id, nil) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil +} diff --git a/cli/cmd_net.go b/cli/cmd_net.go new file mode 100644 index 00000000..6401755c --- /dev/null +++ b/cli/cmd_net.go @@ -0,0 +1,83 @@ +package main + +import ( + "fmt" + "strings" + "time" +) + +func netCommands() []Command { + return []Command{ + {Path: []string{"net", "check"}, Tier: TierRead, + Summary: "reachability of [/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck}, + {Path: []string{"dns", "lookup"}, Tier: TierRead, + Summary: "resolve via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup}, + } +} + +func fmtProbe(code int, d time.Duration, err error) string { + if err != nil { + return "ERR " + err.Error() + } + return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds()) +} + +func netCheck(args []string) error { + host, rest := firstPositional(args) + if host == "" { + return fmt.Errorf("usage: homelab net check [path]") + } + path := "/" + if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") { + path = rest[0] + if !strings.HasPrefix(path, "/") { + path = "/" + path + } + } + u := "https://" + host + path + fmt.Printf("%s\n", u) + + // external leg: resolve via public DNS, dial the public IP (tests the real CF path) + pubOut, _ := dig(hostOnly(host), "1.1.1.1", "") + if pubIP := firstLine(pubOut); pubIP != "" { + c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u) + fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e)) + } else { + fmt.Println(" external (public) no public A record") + } + // internal leg: dial the Traefik LB directly + c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u) + fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e)) + return nil +} + +func dnsLookup(args []string) error { + name, rest := firstPositional(args) + if name == "" { + return fmt.Errorf("usage: homelab dns lookup [A|AAAA|TXT|MX|PTR]") + } + rr := "" + if len(rest) > 0 { + rr = rest[0] + } + tech, _ := dig(name, "10.0.20.201", rr) + pub, _ := dig(name, "1.1.1.1", rr) + fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech)) + fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub)) + if strings.TrimSpace(tech) != strings.TrimSpace(pub) { + fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap") + } + return nil +} + +func hostOnly(h string) string { // strip any path accidentally included + return strings.SplitN(h, "/", 2)[0] +} + +func oneLineList(s string) string { + s = strings.TrimSpace(s) + if s == "" { + return "(none)" + } + return strings.ReplaceAll(s, "\n", ", ") +} diff --git a/cli/cmd_obs.go b/cli/cmd_obs.go new file mode 100644 index 00000000..33f16e6c --- /dev/null +++ b/cli/cmd_obs.go @@ -0,0 +1,197 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/url" + "sort" + "strconv" + "strings" + "time" +) + +const ( + promHost = "prometheus-query.viktorbarzin.lan" + lokiHost = "loki.viktorbarzin.lan" +) + +func obsCommands() []Command { + return []Command{ + {Path: []string{"metrics", "query"}, Tier: TierRead, + Summary: `Prometheus instant query: metrics query "" [--json]`, Run: metricsQuery}, + {Path: []string{"metrics", "alerts"}, Tier: TierRead, + Summary: "list currently firing Prometheus alerts", Run: metricsAlerts}, + {Path: []string{"logs", "query"}, Tier: TierRead, + Summary: `Loki query (last --since, default 1h): logs query "" [--since 1h] [--limit N] [--json]`, Run: logsQuery}, + } +} + +// queryArg joins non-flag args into the query (PromQL/LogQL should normally be +// passed as a single quoted argument; this also tolerates unquoted multi-token). +func queryArg(args []string, valueFlags map[string]bool) string { + var parts []string + for i := 0; i < len(args); i++ { + a := args[i] + if valueFlags[a] { + i++ + continue + } + if strings.HasPrefix(a, "-") { + continue + } + parts = append(parts, a) + } + return strings.Join(parts, " ") +} + +func labelStr(m map[string]string) string { + name := m["__name__"] + var kv []string + for k, v := range m { + if k != "__name__" { + kv = append(kv, k+"="+v) + } + } + sort.Strings(kv) + return name + "{" + strings.Join(kv, ",") + "}" +} + +func metricsQuery(args []string) error { + q := queryArg(args, nil) + if q == "" { + return fmt.Errorf(`usage: homelab metrics query "" [--json]`) + } + v := url.Values{} + v.Set("query", q) + body, err := lbGetBody(promHost, "/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + Value []interface{} `json:"value"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + if len(r.Data.Result) == 0 { + fmt.Println("(no series)") + return nil + } + for _, s := range r.Data.Result { + val := "" + if len(s.Value) == 2 { + val = fmt.Sprint(s.Value[1]) + } + fmt.Printf("%-14s %s\n", val, labelStr(s.Metric)) + } + return nil +} + +func metricsAlerts(args []string) error { + // prometheus-query is a query-only frontend (no /api/v1/alerts); the firing + // set is exposed as the synthetic ALERTS series, queryable the normal way. + v := url.Values{} + v.Set("query", `ALERTS{alertstate="firing"}`) + body, err := lbGetBody(promHost, "/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + if len(r.Data.Result) == 0 { + fmt.Println("(no firing alerts)") + return nil + } + for _, a := range r.Data.Result { + m := a.Metric + scope := "" + for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} { + if v := m[k]; v != "" { + scope = k + "=" + v + break + } + } + fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope) + } + return nil +} + +func logsQuery(args []string) error { + q := queryArg(args, map[string]bool{"--since": true, "--limit": true}) + if q == "" { + return fmt.Errorf(`usage: homelab logs query "" [--since 1h] [--limit N] [--json]`) + } + since := flagValue(args, "--since") + if since == "" { + since = "1h" + } + dur, err := time.ParseDuration(since) + if err != nil { + return fmt.Errorf("bad --since %q: %w", since, err) + } + limit := flagValue(args, "--limit") + if limit == "" { + limit = "100" + } + end := time.Now() + v := url.Values{} + v.Set("query", q) + v.Set("limit", limit) + v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10)) + v.Set("end", strconv.FormatInt(end.UnixNano(), 10)) + body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Values [][]string `json:"values"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + n := 0 + for _, s := range r.Data.Result { + for _, val := range s.Values { + if len(val) == 2 { + fmt.Println(val[1]) + n++ + } + } + } + if n == 0 { + fmt.Println("(no log lines)") + } + return nil +} diff --git a/cli/cmd_tf.go b/cli/cmd_tf.go new file mode 100644 index 00000000..95e0260b --- /dev/null +++ b/cli/cmd_tf.go @@ -0,0 +1,122 @@ +package main + +import ( + "fmt" + "os" + "os/signal" + "path/filepath" + "strings" + "sync" + "syscall" +) + +func tfCommands() []Command { + return []Command{ + {Path: []string{"tf", "plan"}, Tier: TierRead, + Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")}, + {Path: []string{"tf", "validate"}, Tier: TierRead, + Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")}, + {Path: []string{"tf", "fmt"}, Tier: TierRead, + Summary: "terraform fmt a stack's files", Run: tfFmt}, + {Path: []string{"tf", "force-unlock"}, Tier: TierWrite, + Summary: "release a stuck terraform state lock (needs )", Run: tfForceUnlock}, + {Path: []string{"tf", "apply"}, Tier: TierWrite, + Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply}, + } +} + +// firstPositional returns the first non-flag arg and the remaining args with it removed. +func firstPositional(args []string) (string, []string) { + for i, a := range args { + if !strings.HasPrefix(a, "-") { + rest := append(append([]string{}, args[:i]...), args[i+1:]...) + return a, rest + } + } + return "", args +} + +// resolveTfStack finds the infra root (from cwd) and the stack directory named +// by the first positional arg, returning the remaining args. +func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) { + stackName, rest = firstPositional(args) + if stackName == "" { + err = fmt.Errorf("missing argument") + return + } + cwd, e := os.Getwd() + if e != nil { + err = e + return + } + infraRoot, err = findInfraRoot(cwd) + if err != nil { + return + } + stackDir, err = resolveStack(infraRoot, stackName) + return +} + +func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") } + +// tfPassthrough runs `scripts/tg [extra]` in the stack directory. +func tfPassthrough(verb string) func([]string) error { + return func(args []string) error { + infraRoot, _, stackDir, rest, err := resolveTfStack(args) + if err != nil { + return err + } + return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...) + } +} + +func tfFmt(args []string) error { + _, _, stackDir, _, err := resolveTfStack(args) + if err != nil { + return err + } + return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".") +} + +func tfForceUnlock(args []string) error { + infraRoot, _, stackDir, rest, err := resolveTfStack(args) + if err != nil { + return err + } + if len(rest) < 1 { + return fmt.Errorf("usage: homelab tf force-unlock ") + } + return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0]) +} + +// tfApply applies a stack out-of-band: claim the stack on the presence board, +// ALWAYS release on exit (normal, error, or signal — fixing the claim leak), +// and warn that CI applies canonically on push. +func tfApply(args []string) error { + infraRoot, stackName, stackDir, _, err := resolveTfStack(args) + if err != nil { + return err + } + label := "stack:" + stackName + fmt.Fprintf(os.Stderr, + "homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName) + + if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil { + return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err) + } + // Release exactly once, whether we exit normally, on error, or on signal — + // sync.Once makes the defer and the signal goroutine safe to both call it. + var once sync.Once + release := func() { once.Do(func() { _ = presenceRelease(label) }) } + defer release() + + sig := make(chan os.Signal, 1) + signal.Notify(sig, os.Interrupt, syscall.SIGTERM) + go func() { + <-sig + release() + os.Exit(130) + }() + + return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive") +} diff --git a/cli/cmd_tf_test.go b/cli/cmd_tf_test.go new file mode 100644 index 00000000..74f5b9bd --- /dev/null +++ b/cli/cmd_tf_test.go @@ -0,0 +1,27 @@ +package main + +import ( + "reflect" + "testing" +) + +func TestFirstPositional(t *testing.T) { + cases := []struct { + args []string + wantName string + wantRest []string + }{ + {[]string{"vault"}, "vault", []string{}}, + {[]string{"--json", "vault"}, "vault", []string{"--json"}}, + {[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}}, + {[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}}, + {[]string{"--only-flags"}, "", []string{"--only-flags"}}, + } + for _, c := range cases { + gotName, gotRest := firstPositional(c.args) + if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) { + t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)", + c.args, gotName, gotRest, c.wantName, c.wantRest) + } + } +} diff --git a/cli/cmd_usage.go b/cli/cmd_usage.go new file mode 100644 index 00000000..e9b7fa8e --- /dev/null +++ b/cli/cmd_usage.go @@ -0,0 +1,77 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/url" + "sort" + "strconv" +) + +func usageCommands() []Command { + return []Command{ + {Path: []string{"usage", "top"}, Tier: TierRead, + Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop}, + } +} + +// usageQuery builds the LogQL metric query that counts invocations per verb. +func usageQuery(since, user string) string { + sel := `job="` + usageJob + `"` + if user != "" { + sel += `, user="` + user + `"` + } + return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since) +} + +func usageTop(args []string) error { + since := flagValue(args, "--since") + if since == "" { + since = "30d" + } + v := url.Values{} + v.Set("query", usageQuery(since, flagValue(args, "--user"))) + body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + Value []interface{} `json:"value"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + type row struct { + verb string + n int + } + var rows []row + for _, s := range r.Data.Result { + n := 0 + if len(s.Value) == 2 { + if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil { + n = int(f) + } + } + rows = append(rows, row{s.Metric["verb"], n}) + } + if len(rows) == 0 { + fmt.Println("(no usage recorded yet)") + return nil + } + sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n }) + for _, r := range rows { + fmt.Printf("%6d %s\n", r.n, r.verb) + } + return nil +} diff --git a/cli/cmd_vault.go b/cli/cmd_vault.go new file mode 100644 index 00000000..bf270886 --- /dev/null +++ b/cli/cmd_vault.go @@ -0,0 +1,663 @@ +package main + +import ( + "bufio" + "encoding/base64" + "encoding/json" + "fmt" + "os" + "os/exec" + "strings" + "syscall" +) + +// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault. +// Identity is the kernel UID; per-user creds live in that user's isolated Vault +// path (secret/workstation/claude-users/) read via their scoped token, and +// decryption is done by the official `bw` CLI. See +// docs/superpowers/specs/2026-06-24-homelab-vault-design.md. +func vaultCommands() []Command { + return []Command{ + {Path: []string{"vault", "setup"}, Tier: TierWrite, + Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup}, + {Path: []string{"vault", "status"}, Tier: TierRead, + Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus}, + {Path: []string{"vault", "list"}, Tier: TierRead, + Summary: "list your item names: vault list [--search Q]", Run: vaultList}, + {Path: []string{"vault", "get"}, Tier: TierRead, + Summary: "fetch one item: vault get [--field password|username|uri|notes|totp] [--json]", Run: vaultGet}, + {Path: []string{"vault", "search"}, Tier: TierRead, + Summary: "search your item names: vault search ", Run: vaultSearch}, + {Path: []string{"vault", "code"}, Tier: TierRead, + Summary: "current TOTP code for an item: vault code ", Run: vaultCode}, + {Path: []string{"vault", "lock"}, Tier: TierWrite, + Summary: "lock/log out the local bw session", Run: vaultLock}, + {Path: []string{"vault"}, Tier: TierRead, + Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)", + Run: func([]string) error { fmt.Print(vaultHelp()); return nil }}, + } +} + +// vaultHelp is shown for bare `homelab vault`. +func vaultHelp() string { + return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup) + + homelab vault setup one-time: store your master password + API key in your Vault path + homelab vault status configured / unlocked / reachable (no secrets) + homelab vault list [--search Q] list your item names (no secrets) + homelab vault get [--field password|username|uri|notes|totp] [--json] + TTY → clipboard (auto-clears); piped → stdout + homelab vault code current TOTP code + homelab vault lock lock / log out the local bw session + +Creds live only in your own Vault path; the admin never sees them. Identity is +your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md +(note: anything running as your user can decrypt your vault — the accepted no-HITL trade). +` +} + +const vwUserPathPrefix = "secret/workstation/claude-users/" + +// vwCreds is one user's Vaultwarden auth material, read from their Vault path. +type vwCreds struct { + Email string + MasterPassword string + ClientID string + ClientSecret string +} + +// cmdRunner shells out to an external command with an explicit environment and +// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject +// a fake; realRunner is the production implementation. +type cmdRunner func(name string, argv, envv []string) (string, error) + +func realRunner(name string, argv, envv []string) (string, error) { + cmd := exec.Command(name, argv...) + if envv != nil { + cmd.Env = envv + } + out, err := cmd.Output() + // Trim only the trailing newline the tool appends — NOT all whitespace, so a + // fetched secret with significant leading/trailing spaces is preserved. + return strings.TrimRight(string(out), "\r\n"), err +} + +// realRunnerStdin runs a command feeding `stdin` to it, for secret values that +// must NOT appear in argv (visible via ps / /proc//cmdline to same-UID +// processes). Used by setup to write the master password / client_secret. +func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) { + cmd := exec.Command(name, argv...) + if envv != nil { + cmd.Env = envv + } + cmd.Stdin = strings.NewReader(stdin) + out, err := cmd.Output() + return strings.TrimRight(string(out), "\r\n"), err +} + +func vwCredsPath(user string) string { return vwUserPathPrefix + user } + +func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" } + +// readVaultField returns one field from a KV-v2 path, "" if absent/error. +func readVaultField(run cmdRunner, field, path string) string { + out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil) + if err != nil { + return "" + } + return out +} + +// loadCreds reads the four vaultwarden_* keys from the user's isolated path. +// A missing master password means the user hasn't onboarded. +func loadCreds(run cmdRunner, user string) (vwCreds, error) { + p := vwCredsPath(user) + c := vwCreds{ + Email: readVaultField(run, "vaultwarden_email", p), + MasterPassword: readVaultField(run, "vaultwarden_master_password", p), + ClientID: readVaultField(run, "vaultwarden_client_id", p), + ClientSecret: readVaultField(run, "vaultwarden_client_secret", p), + } + if c.MasterPassword == "" { + return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`") + } + return c, nil +} + +// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func). +var vaultCurrentUser = func() string { return os.Getenv("USER") } +var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) } + +// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately +// do NOT inherit the full parent env (keeps stray secrets out of the child). +func bwBaseEnv(appdata string) []string { + path := os.Getenv("PATH") + if path == "" { + path = "/usr/local/bin:/usr/bin:/bin" + } + return []string{ + "PATH=" + path, + "HOME=" + os.Getenv("HOME"), + "BITWARDENCLI_APPDATA_DIR=" + appdata, + "BW_NOINTERACTION=true", + } +} + +// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock). +func bwSecretEnv(appdata string, c vwCreds, session string) []string { + env := bwBaseEnv(appdata) + env = append(env, + "BW_CLIENTID="+c.ClientID, + "BW_CLIENTSECRET="+c.ClientSecret, + "BW_PASSWORD="+c.MasterPassword, + ) + if session != "" { + env = append(env, "BW_SESSION="+session) + } + return env +} + +func bwLoginArgs() []string { return []string{"login", "--apikey"} } +func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} } +func bwGetArgs(field, name string) []string { return []string{"get", field, name} } +func bwStatusArgs() []string { return []string{"status"} } + +// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is +// required. Unparseable/empty output → true (safer to attempt login). +func bwNeedsLogin(statusJSON string) bool { + var s struct { + Status string `json:"status"` + } + if err := json.Unmarshal([]byte(statusJSON), &s); err != nil { + return true + } + return s.Status == "unauthenticated" || s.Status == "" +} + +func bwListArgs(search string) []string { + a := []string{"list", "items"} + if search != "" { + a = append(a, "--search", search) + } + return a +} + +// bwUnlock runs `bw unlock` and returns the raw session key. +func bwUnlock(run cmdRunner, env []string) (string, error) { + out, err := run("bw", bwUnlockArgs(), env) + if err != nil { + return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err) + } + return out, nil +} + +// bwGet fetches one field of one item; session must be present in env. +func bwGet(run cmdRunner, env []string, field, name string) (string, error) { + return run("bw", bwGetArgs(field, name), env) +} + +func returnMode(isTTY bool) string { + if isTTY { + return "clipboard" + } + return "stdout" +} + +// stdoutIsTTY reports whether stdout is a character device (a terminal). +func stdoutIsTTY() bool { + fi, err := os.Stdout.Stat() + if err != nil { + return false + } + return fi.Mode()&os.ModeCharDevice != 0 +} + +// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written +// to stderr, so the clipboard path is only viable when stderr is a terminal). +func stderrIsTTY() bool { + fi, err := os.Stderr.Stat() + if err != nil { + return false + } + return fi.Mode()&os.ModeCharDevice != 0 +} + +// osc52 returns the OSC 52 escape that makes the local terminal copy payload to +// the system clipboard (works over SSH; no X11). osc52clear copies empty. +func osc52(payload string) string { + return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a" +} +func osc52clear() string { return "\x1b]52;c;\a" } + +// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes, +// else we'd dump the secret's base64 into scrollback on unsupported terminals. +func terminalAllowed(term, termProgram string) bool { + t := strings.ToLower(term) + p := strings.ToLower(termProgram) + for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} { + if strings.Contains(t, ok) || strings.Contains(p, ok) { + return true + } + } + // xterm proper supports it only when the program is a known-good emulator. + return false +} + +// opRecord is one CLI operation. ItemName is accepted for the caller's +// convenience but is INTENTIONALLY never rendered into the log line — auditing +// which of your own logins you opened is itself sensitive, and per-item reads +// are invisible server-side anyway (spec §9a). +type opRecord struct { + User string + Verb string + PID int + PPID int + ParentComm string + ItemName string // never logged +} + +func opLogLine(r opRecord) string { + return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s", + r.User, r.Verb, r.PID, r.PPID, r.ParentComm) +} + +// parentComm reads /proc//comm (best-effort; "" on failure). +func parentComm(ppid int) string { + b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid)) + if err != nil { + return "" + } + return strings.TrimSpace(string(b)) +} + +// writeOpLog appends one privacy-aware line to the user's op-log (best-effort; +// never blocks or fails the command). Goes to syslog so it ships to Loki. +func writeOpLog(r opRecord) { + exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort +} + +func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" } + +// hardenProcess disables core dumps so a bw/homelab crash can't spill the master +// password to a core file. Best-effort. +func hardenProcess() { + _ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0}) +} + +// withUserLock serializes bw mutations for this user (concurrent Claude sessions +// as the same user otherwise race bw's appdata). Returns an unlock func. +func withUserLock(uid string) (func(), error) { + f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600) + if err != nil { + return nil, err + } + if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil { + f.Close() + return nil, err + } + return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil +} + +// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`. +type session struct { + env []string +} + +// openSession resolves creds, ensures login, unlocks, and returns a ready env. +// Caller must hold the user lock. appdata is created on tmpfs (0700). +func openSession(run cmdRunner, user, uid string) (session, error) { + creds, err := loadCreds(run, user) + if err != nil { + return session{}, err + } + appdata := bwAppDataDir(uid) + if err := os.MkdirAll(appdata, 0700); err != nil { + return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err) + } + loginEnv := bwSecretEnv(appdata, creds, "") + // Ensure server is set and we're logged in (idempotent; ignore "already"). + _, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv) + st, _ := run("bw", bwStatusArgs(), loginEnv) + if bwNeedsLogin(st) { + if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil { + return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err) + } + } + sess, err := bwUnlock(run, loginEnv) + if err != nil { + return session{}, err + } + return session{env: bwSecretEnv(appdata, creds, sess)}, nil +} + +type getOpts struct { + name string + field string + json bool +} + +var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true} + +func parseGetArgs(args []string) (getOpts, error) { + o := getOpts{field: "password"} + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--json": + o.json = true + case a == "--field" && i+1 < len(args): + o.field = args[i+1] + i++ + case strings.HasPrefix(a, "--field="): + o.field = strings.TrimPrefix(a, "--field=") + case !strings.HasPrefix(a, "-") && o.name == "": + o.name = a + } + } + if o.name == "" { + return o, fmt.Errorf("usage: homelab vault get [--field password|username|uri|notes|totp] [--json]") + } + if !validGetFields[o.field] { + return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field) + } + return o, nil +} + +// getValue opens a session and fetches one field. Pure of I/O side effects +// besides the runner, so it is unit-tested with a fake runner. +func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) { + s, err := openSession(run, user, uid) + if err != nil { + return "", err + } + return bwGet(run, s.env, o.field, o.name) +} + +// clipboardDecision picks how to return a secret value. "stdout" prints it (a +// pipe/agent — the intended machine path); "clipboard" copies via OSC52; +// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's +// base64 into scrollback, or silently fail because the OSC52 escape goes to a +// non-terminal stderr). +func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string { + if !stdoutTTY { + return "stdout" + } + if terminalAllowed(term, termProgram) && stderrTTY { + return "clipboard" + } + return "refuse" +} + +// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only +// when stdout is NOT a terminal (i.e. piped to a machine consumer). +func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY } + +// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the +// secret to a terminal's stdout/scrollback. +func emitSecret(value string) { + switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) { + case "stdout": + fmt.Println(value) + case "clipboard": + fmt.Fprint(os.Stderr, osc52(value)) + fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s") + clearClipboardAfter(30) + default: // refuse + fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal") + } +} + +// clearClipboardAfter spawns a detached background clear so the secret doesn't +// linger in the clipboard. Best-effort. +func clearClipboardAfter(seconds int) { + exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start() +} + +// listNames extracts "name (id)" from `bw list items` JSON; never values. +func listNames(jsonOut string) []string { + var items []struct { + ID string `json:"id"` + Name string `json:"name"` + } + if err := json.Unmarshal([]byte(jsonOut), &items); err != nil { + return nil + } + out := make([]string, 0, len(items)) + for _, it := range items { + out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID)) + } + return out +} + +func runList(run cmdRunner, user, uid, search string) ([]string, error) { + s, err := openSession(run, user, uid) + if err != nil { + return nil, err + } + out, err := run("bw", bwListArgs(search), s.env) + if err != nil { + return nil, err + } + return listNames(out), nil +} + +func vaultList(args []string) error { + hardenProcess() + search := "" + for i := 0; i < len(args); i++ { + if args[i] == "--search" && i+1 < len(args) { + search = args[i+1] + i++ + } else if strings.HasPrefix(args[i], "--search=") { + search = strings.TrimPrefix(args[i], "--search=") + } + } + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + names, err := runList(realRunner, vaultCurrentUser(), uid, search) + if err != nil { + return err + } + for _, n := range names { + fmt.Println(n) + } + return nil +} + +func vaultSearch(args []string) error { + if len(args) == 0 { + return fmt.Errorf("usage: homelab vault search ") + } + return vaultList([]string{"--search", strings.Join(args, " ")}) +} + +func vaultCode(args []string) error { + hardenProcess() + if len(args) == 0 { + return fmt.Errorf("usage: homelab vault code ") + } + name := args[0] + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + user := vaultCurrentUser() + val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"}) + if err != nil { + return err + } + // TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d). + writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name}) + exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run() + emitSecret(val) + return nil +} + +// statusSummary reports config/reachability without revealing secrets. +func statusSummary(run cmdRunner, user, uid string) string { + if _, err := loadCreds(run, user); err != nil { + return "vault: not configured — run `homelab vault setup`" + } + s, err := openSession(run, user, uid) + if err != nil { + return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error() + } + if _, err := run("bw", []string{"sync"}, s.env); err != nil { + return "vault: configured + unlocked, but sync/reachability failed: " + err.Error() + } + return "vault: configured, unlocked, reachable ✓" +} + +func vaultStatus(args []string) error { + hardenProcess() + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid)) + return nil +} + +func vaultLock(args []string) error { + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list + if err != nil { + return err + } + defer unlock() + appdata := bwAppDataDir(uid) + _, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata)) + _, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata)) + if logoutErr == nil { + fmt.Println("locked") + } + return nil // lock/logout best-effort; never error the caller +} + +// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the +// email nor the API client_id is a usable credential on its own. +func vaultPatchPublicArgs(user, email, clientID string) []string { + return []string{"kv", "patch", vwCredsPath(user), + "vaultwarden_email=" + email, + "vaultwarden_client_id=" + clientID, + } +} + +// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so +// the value never appears in argv (ps / /proc//cmdline). The value is fed +// on stdin by realRunnerStdin. +func vaultPatchSecretArgs(user, key string) []string { + return []string{"kv", "patch", vwCredsPath(user), key + "=-"} +} + +// writeCreds stores all four fields in the user's Vault path. The two real +// secrets (master password, API client_secret) go via stdin — never argv. +func writeCreds(user string, c vwCreds) error { + if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil { + return err + } + if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil { + return err + } + if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil { + return err + } + return nil +} + +// promptNoEcho reads one line without terminal echo (for the master password). +func promptNoEcho(prompt string) (string, error) { + fmt.Fprint(os.Stderr, prompt) + exec.Command("stty", "-echo").Run() + defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }() + r := bufio.NewReader(os.Stdin) + line, err := r.ReadString('\n') + // Trim only the line terminator — a master password / API secret may + // legitimately contain leading/trailing spaces. + return strings.TrimRight(line, "\r\n"), err +} + +func promptLine(prompt string) (string, error) { + fmt.Fprint(os.Stderr, prompt) + line, err := bufio.NewReader(os.Stdin).ReadString('\n') + return strings.TrimSpace(line), err +} + +func vaultSetup(args []string) error { + hardenProcess() + fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.") + fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.") + email, err := promptLine("Vaultwarden email: ") + if err != nil { + return err + } + clientID, err := promptLine("API key client_id (user.xxxx): ") + if err != nil { + return err + } + clientSecret, err := promptNoEcho("API key client_secret: ") + if err != nil { + return err + } + master, err := promptNoEcho("Master password: ") + if err != nil { + return err + } + if master == "" || clientID == "" || clientSecret == "" { + return fmt.Errorf("all fields are required") + } + c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret} + if err := writeCreds(vaultCurrentUser(), c); err != nil { + return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err) + } + fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…") + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil { + return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err) + } + fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.") + return nil +} + +func vaultGet(args []string) error { + hardenProcess() + o, err := parseGetArgs(args) + if err != nil { + return err + } + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + user := vaultCurrentUser() + val, err := getValue(realRunner, user, uid, o) + if err != nil { + return err + } + writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name}) + if o.json { + if !jsonToStdoutOK(stdoutIsTTY()) { + return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json") + } + fmt.Printf("{%q:%q}\n", o.field, val) + return nil + } + emitSecret(val) + return nil +} + diff --git a/cli/cmd_vault_test.go b/cli/cmd_vault_test.go new file mode 100644 index 00000000..36aab1f4 --- /dev/null +++ b/cli/cmd_vault_test.go @@ -0,0 +1,368 @@ +package main + +import ( + "encoding/base64" + "fmt" + "os" + "reflect" + "strings" + "testing" +) + +func TestVaultCommandsRegistered(t *testing.T) { + want := map[string]Tier{ + "vault setup": TierWrite, + "vault status": TierRead, + "vault list": TierRead, + "vault get": TierRead, + "vault search": TierRead, + "vault code": TierRead, + "vault lock": TierWrite, + } + got := map[string]Tier{} + for _, c := range vaultCommands() { + got[c.name()] = c.Tier + } + for name, tier := range want { + if got[name] != tier { + t.Errorf("command %q: tier=%q, want %q (registered=%v)", name, got[name], tier, got[name] != "") + } + } +} + +func TestVaultGroupInRegistry(t *testing.T) { + if !isCommandGroup(buildRegistry(), "vault") { + t.Fatal("`vault` group not wired into buildRegistry()") + } +} + +func TestVaultCredsPath(t *testing.T) { + if got := vwCredsPath("emo"); got != "secret/workstation/claude-users/emo" { + t.Fatalf("vwCredsPath = %q", got) + } +} + +func TestBwAppDataDir(t *testing.T) { + if got := bwAppDataDir("1001"); got != "/run/user/1001/homelab-bw" { + t.Fatalf("bwAppDataDir = %q", got) + } +} + +// fakeRunner records calls and returns canned stdout/err keyed by argv[0]+first arg. +type fakeRunner struct { + calls [][]string + out map[string]string // key: name+" "+strings.Join(argv," ") prefix-matched + err map[string]error + lastEnv []string +} + +func (f *fakeRunner) run(name string, argv, envv []string) (string, error) { + f.calls = append(f.calls, append([]string{name}, argv...)) + f.lastEnv = envv + key := name + " " + strings.Join(argv, " ") + for k, v := range f.out { + if strings.HasPrefix(key, k) { + return v, f.err[k] + } + } + return "", f.err[key] +} + +func TestLoadCredsReadsFourFields(t *testing.T) { + f := &fakeRunner{out: map[string]string{ + "vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me", + "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek", + }} + c, err := loadCreds(f.run, "emo") + if err != nil { + t.Fatalf("loadCreds: %v", err) + } + want := vwCreds{Email: "emo@x.me", MasterPassword: "hunter2", ClientID: "user.abc", ClientSecret: "sek"} + if !reflect.DeepEqual(c, want) { + t.Fatalf("loadCreds = %+v want %+v", c, want) + } +} + +func TestLoadCredsUnconfigured(t *testing.T) { + f := &fakeRunner{out: map[string]string{}} // every field empty + if _, err := loadCreds(f.run, "emo"); err == nil || !strings.Contains(err.Error(), "not configured") { + t.Fatalf("want 'not configured' error, got %v", err) + } +} + +func TestBwEnvCarriesSecretsNotArgv(t *testing.T) { + c := vwCreds{ClientID: "user.abc", ClientSecret: "sek", MasterPassword: "hunter2"} + env := bwSecretEnv("/run/user/1001/homelab-bw", c, "SESSIONKEY") + joined := strings.Join(env, "\n") + for _, want := range []string{ + "BW_CLIENTID=user.abc", "BW_CLIENTSECRET=sek", "BW_PASSWORD=hunter2", + "BW_SESSION=SESSIONKEY", "BITWARDENCLI_APPDATA_DIR=/run/user/1001/homelab-bw", + } { + if !strings.Contains(joined, want) { + t.Errorf("bwSecretEnv missing %q", want) + } + } + if strings.Contains(joined, "PATH=") == false { + t.Error("bwSecretEnv must keep a PATH so node/bw resolve") + } +} + +func TestBwGetArgsHasNoSessionInArgv(t *testing.T) { + argv := bwGetArgs("password", "github") + for _, a := range argv { + if strings.Contains(a, "SESSION") || a == "--session" { + t.Fatalf("session must travel via env, not argv: %v", argv) + } + } + if !reflect.DeepEqual(argv, []string{"get", "password", "github"}) { + t.Fatalf("bwGetArgs = %v", argv) + } +} + +func TestBwListArgs(t *testing.T) { + if got := bwListArgs(""); !reflect.DeepEqual(got, []string{"list", "items"}) { + t.Fatalf("bwListArgs('') = %v", got) + } + if got := bwListArgs("git"); !reflect.DeepEqual(got, []string{"list", "items", "--search", "git"}) { + t.Fatalf("bwListArgs('git') = %v", got) + } +} + +func TestBwUnlockReturnsSession(t *testing.T) { + f := &fakeRunner{out: map[string]string{"bw unlock": "THE-SESSION-KEY"}} + env := bwSecretEnv("/run/user/1001/homelab-bw", vwCreds{MasterPassword: "pw"}, "") + sess, err := bwUnlock(f.run, env) + if err != nil || sess != "THE-SESSION-KEY" { + t.Fatalf("bwUnlock = %q, %v", sess, err) + } + // argv must use --passwordenv + --raw, never the password literal + last := f.calls[len(f.calls)-1] + if strings.Join(last, " ") != "bw unlock --passwordenv BW_PASSWORD --raw" { + t.Fatalf("unlock argv = %v", last) + } +} + +func TestReturnMode(t *testing.T) { + if returnMode(true) != "clipboard" || returnMode(false) != "stdout" { + t.Fatal("returnMode wrong") + } +} + +func TestOSC52Encode(t *testing.T) { + got := osc52("secret") + want := "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte("secret")) + "\a" + if got != want { + t.Fatalf("osc52 = %q want %q", got, want) + } + if osc52clear() != "\x1b]52;c;\a" { + t.Fatalf("osc52clear wrong: %q", osc52clear()) + } +} + +func TestTerminalAllowed(t *testing.T) { + allow := []struct{ term, prog string }{ + {"xterm-kitty", ""}, {"alacritty", ""}, {"foot", ""}, {"tmux-256color", ""}, + {"screen-256color", ""}, {"xterm-256color", "WezTerm"}, {"xterm-256color", "ghostty"}, + } + for _, c := range allow { + if !terminalAllowed(c.term, c.prog) { + t.Errorf("terminalAllowed(%q,%q) = false, want true", c.term, c.prog) + } + } + deny := []struct{ term, prog string }{{"dumb", ""}, {"", ""}, {"vt100", ""}} + for _, c := range deny { + if terminalAllowed(c.term, c.prog) { + t.Errorf("terminalAllowed(%q,%q) = true, want false", c.term, c.prog) + } + } +} + +func TestOpLogLineHasNoSecretOrItem(t *testing.T) { + line := opLogLine(opRecord{User: "emo", Verb: "get", PID: 10, PPID: 9, ParentComm: "claude", ItemName: "Chase Bank"}) + for _, must := range []string{"user=emo", "verb=get", "ppid=9", "parent=claude"} { + if !strings.Contains(line, must) { + t.Errorf("op-log missing %q: %s", must, line) + } + } + for _, mustNot := range []string{"Chase", "password", "secret"} { + if strings.Contains(line, mustNot) { + t.Fatalf("op-log LEAKS %q (privacy violation): %s", mustNot, line) + } + } +} + +func TestLockPath(t *testing.T) { + if got := vaultLockPath("1001"); got != "/run/user/1001/homelab-vault.lock" { + t.Fatalf("vaultLockPath = %q", got) + } +} + +func TestParseGetArgs(t *testing.T) { + o, err := parseGetArgs([]string{"github", "--field", "username", "--json"}) + if err != nil || o.name != "github" || o.field != "username" || !o.json { + t.Fatalf("parseGetArgs = %+v err=%v", o, err) + } + d, _ := parseGetArgs([]string{"github"}) + if d.field != "password" || d.json { + t.Fatalf("defaults wrong: %+v", d) + } + if _, err := parseGetArgs([]string{}); err == nil { + t.Fatal("get with no name must error") + } + if _, err := parseGetArgs([]string{"x", "--field", "evil"}); err == nil { + t.Fatal("invalid --field must error") + } +} + +func TestListNamesParsing(t *testing.T) { + // bw list items returns JSON; listNames extracts name + id only. + js := `[{"id":"1","name":"GitHub","login":{"username":"u"}},{"id":"2","name":"AWS"}]` + names := listNames(js) + if len(names) != 2 || names[0] != "GitHub (1)" || names[1] != "AWS (2)" { + t.Fatalf("listNames = %v", names) + } +} + +func TestStatusSummaryUnconfigured(t *testing.T) { + f := &fakeRunner{out: map[string]string{}} // no creds + s := statusSummary(f.run, "emo", "1001") + if !strings.Contains(s, "not configured") { + t.Fatalf("status = %q", s) + } +} + +func TestVaultPatchPublicArgs(t *testing.T) { + got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci") + want := []string{"kv", "patch", "secret/workstation/claude-users/emo", + "vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"} + if !reflect.DeepEqual(got, want) { + t.Fatalf("vaultPatchPublicArgs = %v", got) + } + for _, a := range got { + if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") { + t.Fatalf("secret key leaked into public argv: %v", got) + } + } +} + +func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) { + for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} { + got := vaultPatchSecretArgs("emo", key) + want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"} + if !reflect.DeepEqual(got, want) { + t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got) + } + if got[len(got)-1] != key+"=-" { + t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got) + } + } +} + +// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the +// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret +// value may appear in any command's argv — secrets travel via env/stdin only. +func TestNoSecretInArgvAcrossFlow(t *testing.T) { + uid := fmt.Sprintf("%d", os.Getuid()) + f := &fakeRunner{out: map[string]string{ + "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET", + "bw status": `{"status":"locked"}`, + "bw unlock": "SESSIONXYZ", + "bw get password github": "p@ss", + }} + if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil { + t.Fatalf("getValue: %v", err) + } + for _, call := range f.calls { + for _, arg := range call { + for _, s := range []string{"SUPERSECRETPW", "CLIENTSEKRET", "SESSIONXYZ"} { + if strings.Contains(arg, s) { + t.Errorf("secret %q leaked into argv: %v", s, call) + } + } + } + } + if !strings.Contains(strings.Join(f.lastEnv, "\n"), "BW_SESSION=SESSIONXYZ") { + t.Error("expected BW_SESSION in the bw get env (test would be vacuous otherwise)") + } +} + +func TestClipboardDecision(t *testing.T) { + cases := []struct { + stdoutTTY, stderrTTY bool + term, prog, want string + }{ + {false, true, "xterm-kitty", "", "stdout"}, + {true, true, "xterm-kitty", "", "clipboard"}, + {true, true, "dumb", "", "refuse"}, + {true, false, "xterm-kitty", "", "refuse"}, + } + for _, c := range cases { + if got := clipboardDecision(c.stdoutTTY, c.stderrTTY, c.term, c.prog); got != c.want { + t.Errorf("clipboardDecision(%v,%v,%q) = %q, want %q", c.stdoutTTY, c.stderrTTY, c.term, got, c.want) + } + } +} + +func TestJSONToStdoutOK(t *testing.T) { + if jsonToStdoutOK(true) { + t.Error("must refuse JSON secret on a terminal") + } + if !jsonToStdoutOK(false) { + t.Error("must allow JSON when piped") + } +} + +func TestBwNeedsLogin(t *testing.T) { + if !bwNeedsLogin(`{"status":"unauthenticated"}`) { + t.Error("unauthenticated → needs login") + } + if bwNeedsLogin(`{"status":"locked"}`) { + t.Error("locked → no login (just unlock)") + } + if bwNeedsLogin(`{"status":"unlocked"}`) { + t.Error("unlocked → no login") + } + if !bwNeedsLogin(`not json`) { + t.Error("unparseable → attempt login") + } +} + +func TestVaultHelpMentionsSecurity(t *testing.T) { + h := vaultHelp() + for _, want := range []string{"homelab vault get", "no-HITL", "your own", "setup"} { + if !strings.Contains(h, want) { + t.Errorf("vault help missing %q", want) + } + } +} + +func TestVaultBareGroupRegistered(t *testing.T) { + for _, c := range vaultCommands() { + if len(c.Path) == 1 && c.Path[0] == "vault" { + return + } + } + t.Fatal("bare `vault` help command not registered") +} + +// getValue is the testable core: given a runner + opts, returns the secret value. +func TestGetValueFlow(t *testing.T) { + f := &fakeRunner{out: map[string]string{ + "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", + "bw status": `{"status":"locked"}`, + "bw unlock": "SESS", + "bw get password github": "p@ss", + }} + // Use real UID so os.MkdirAll(/run/user//homelab-bw) succeeds. + uid := fmt.Sprintf("%d", os.Getuid()) + val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}) + if err != nil || val != "p@ss" { + t.Fatalf("getValue = %q, %v", val, err) + } +} diff --git a/cli/cmd_work.go b/cli/cmd_work.go new file mode 100644 index 00000000..3bf44e13 --- /dev/null +++ b/cli/cmd_work.go @@ -0,0 +1,212 @@ +package main + +import ( + "fmt" + "os" + "path/filepath" + "strings" +) + +func workCommands() []Command { + return []Command{ + {Path: []string{"work", "start"}, Tier: TierWrite, + Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart}, + {Path: []string{"work", "land"}, Tier: TierWrite, + Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand}, + {Path: []string{"work", "clean"}, Tier: TierWrite, + Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean}, + } +} + +// flagValue extracts `--name value` or `--name=value` from args. +func flagValue(args []string, name string) string { + for i, a := range args { + if a == name && i+1 < len(args) { + return args[i+1] + } + if strings.HasPrefix(a, name+"=") { + return strings.TrimPrefix(a, name+"=") + } + } + return "" +} + +func remotesOrEmpty(repoRoot string) []string { + r, _ := gitRemotes(repoRoot) + return r +} + +// workStart creates .worktrees/ on branch / off /master. +func workStart(args []string) error { + topic, _ := firstPositional(args) + if topic == "" { + return fmt.Errorf("usage: homelab work start ") + } + cwd, _ := os.Getwd() + repoRoot, err := gitRepoRoot(cwd) + if err != nil { + return fmt.Errorf("not in a git repository: %w", err) + } + remote := preferRemote(remotesOrEmpty(repoRoot)) + if remote == "" { + return fmt.Errorf("no git remote configured in %s", repoRoot) + } + flags := cryptFlagsFor(repoRoot) + branch := currentUser() + "/" + topic + wtRel := filepath.Join(".worktrees", topic) + + ensureWorktreesIgnored(repoRoot) + if err := gitStream(repoRoot, flags, "fetch", remote); err != nil { + return fmt.Errorf("fetch %s failed: %w", remote, err) + } + if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil { + return fmt.Errorf("worktree add failed: %w", err) + } + wtPath := filepath.Join(repoRoot, wtRel) + fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote) + fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath) + return nil +} + +// workLand integrates the current branch into master: fetch, merge master in, +// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch +// fallback when the direct push is rejected (e.g. branch protection). +func workLand(args []string) error { + verifyCmd := flagValue(args, "--verify-cmd") + cwd, _ := os.Getwd() + repoRoot, err := gitRepoRoot(cwd) + if err != nil { + return fmt.Errorf("not in a git repository: %w", err) + } + branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD") + if err != nil { + return err + } + if branch == "master" || branch == "main" { + return fmt.Errorf("refusing to land: already on %s", branch) + } + remote := preferRemote(remotesOrEmpty(repoRoot)) + if remote == "" { + return fmt.Errorf("no git remote configured in %s", repoRoot) + } + flags := cryptFlagsFor(repoRoot) + + if err := gitStream(repoRoot, flags, "fetch", remote); err != nil { + return fmt.Errorf("fetch failed: %w", err) + } + if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil { + return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err) + } + if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil { + return fmt.Errorf("not landing: %w", err) + } + if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil { + return landFallback(repoRoot, flags, remote, branch, err) + } + fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote) + if containsArg(args, "--no-ci-watch") { + fmt.Println("homelab: --no-ci-watch set; not waiting for CI.") + return nil + } + landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD") + fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...") + if err := ciWatch([]string{landed}); err != nil { + return fmt.Errorf("landed, but CI did not go green: %w", err) + } + return nil +} + +// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If +// neither is available it REFUSES (returns an error) unless allowSkip is set — +// landing to master unverified must be a deliberate choice (--no-verify). +func runVerify(repoRoot, verifyCmd string, allowSkip bool) error { + if verifyCmd != "" { + fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd) + return runStreamingIn(repoRoot, "sh", "-c", verifyCmd) + } + if isFile(filepath.Join(repoRoot, "go.mod")) { + fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...") + return runStreamingIn(repoRoot, "go", "test", "./...") + } + if allowSkip { + fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification") + return nil + } + return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying") +} + +// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections +// by fetching + merging master and retrying. +func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error { + var lastErr error + for i := 0; i < attempts; i++ { + if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil { + return nil + } else { + lastErr = err + } + if i < attempts-1 { + fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying") + if err := gitStream(repoRoot, flags, "fetch", remote); err != nil { + return err + } + if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil { + return err + } + } + } + return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr) +} + +// landFallback pushes the feature branch when the direct master push is rejected +// (e.g. branch protection), so the work isn't lost and a PR can be opened. +func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error { + fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr) + fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch) + if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil { + return fmt.Errorf("fallback branch push also failed: %w", err) + } + fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote) + return nil +} + +// workClean removes a task's worktree and branch. Run from the main checkout. +func workClean(args []string) error { + topic, _ := firstPositional(args) + if topic == "" { + return fmt.Errorf("usage: homelab work clean (run from the main checkout)") + } + cwd, _ := os.Getwd() + repoRoot, err := gitRepoRoot(cwd) + if err != nil { + return fmt.Errorf("not in a git repository: %w", err) + } + flags := cryptFlagsFor(repoRoot) + wtRel := filepath.Join(".worktrees", topic) + branch := currentUser() + "/" + topic + + if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil { + return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err) + } + if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil { + fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err) + } + fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch) + return nil +} + +// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored. +func ensureWorktreesIgnored(repoRoot string) { + if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil { + return + } + gi := filepath.Join(repoRoot, ".gitignore") + f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644) + if err != nil { + return + } + defer f.Close() + if _, err := f.WriteString("\n.worktrees/\n"); err == nil { + fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore") + } +} diff --git a/cli/cmd_work_test.go b/cli/cmd_work_test.go new file mode 100644 index 00000000..af573dd6 --- /dev/null +++ b/cli/cmd_work_test.go @@ -0,0 +1,32 @@ +package main + +import "testing" + +func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) { + dir := t.TempDir() // no go.mod, no verify cmd + if err := runVerify(dir, "", false); err == nil { + t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent") + } + if err := runVerify(dir, "", true); err != nil { + t.Fatalf("runVerify must skip when --no-verify set, got: %v", err) + } +} + +func TestFlagValue(t *testing.T) { + cases := []struct { + args []string + name string + want string + }{ + {[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."}, + {[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"}, + {[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"}, + {[]string{"topic"}, "--verify-cmd", ""}, + {[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value + } + for _, c := range cases { + if got := flagValue(c.args, c.name); got != c.want { + t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want) + } + } +} diff --git a/cli/command.go b/cli/command.go new file mode 100644 index 00000000..55449788 --- /dev/null +++ b/cli/command.go @@ -0,0 +1,104 @@ +package main + +import ( + "encoding/json" + "fmt" + "sort" + "strings" +) + +// Tier classifies whether a command observes (read) or mutates (write) state. +// v0.1 allows everything; the tier is recorded so a classifier hook can gate +// writes later without restructuring (see docs/adr/0005). +type Tier string + +const ( + TierRead Tier = "read" + TierWrite Tier = "write" +) + +// Command is one homelab verb. Path is the token sequence that selects it, +// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path. +type Command struct { + Path []string + Tier Tier + Summary string + Run func(args []string) error +} + +// dispatch routes args to the command whose Path is the longest matching prefix +// of args, passing the remaining args to its Run. +func dispatch(reg []Command, args []string) error { + best := -1 + bestLen := 0 + for i, c := range reg { + if len(c.Path) > len(args) { + continue + } + match := true + for j, p := range c.Path { + if args[j] != p { + match = false + break + } + } + if match && len(c.Path) >= bestLen { + best = i + bestLen = len(c.Path) + } + } + if best < 0 { + return fmt.Errorf("unknown command: %q", strings.Join(args, " ")) + } + matched := reg[best] + runErr := matched.Run(args[bestLen:]) + emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command + return runErr +} + +// name is the space-joined verb path, e.g. "tf plan". +func (c Command) name() string { return strings.Join(c.Path, " ") } + +// sortedByName returns a copy of reg ordered by verb path for stable output. +func sortedByName(reg []Command) []Command { + out := make([]Command, len(reg)) + copy(out, reg) + sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() }) + return out +} + +// manifestText renders one aligned line per command: " ". +// This is the cheap progressive-discovery entrypoint (see docs/adr/0004). +func manifestText(reg []Command) string { + cmds := sortedByName(reg) + width := 0 + for _, c := range cmds { + if n := len(c.name()); n > width { + width = n + } + } + var b strings.Builder + for _, c := range cmds { + fmt.Fprintf(&b, "%-*s %-5s %s\n", width, c.name(), c.Tier, c.Summary) + } + return b.String() +} + +// manifestJSON renders the registry as a JSON array of {command, tier, summary} +// so agents can parse the full surface in one call. +func manifestJSON(reg []Command) (string, error) { + type entry struct { + Command string `json:"command"` + Tier string `json:"tier"` + Summary string `json:"summary"` + } + entries := make([]entry, 0, len(reg)) + for _, c := range sortedByName(reg) { + entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary}) + } + b, err := json.MarshalIndent(entries, "", " ") + if err != nil { + return "", err + } + return string(b), nil +} diff --git a/cli/command_test.go b/cli/command_test.go new file mode 100644 index 00000000..e686622d --- /dev/null +++ b/cli/command_test.go @@ -0,0 +1,73 @@ +package main + +import ( + "encoding/json" + "reflect" + "strings" + "testing" +) + +// Tracer bullet: the dispatcher must route `homelab ` to the +// command whose Path is the longest matching prefix of the input tokens, and +// hand the command the remaining args. +func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) { + var gotArgs []string + ran := "" + reg := []Command{ + {Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource", + Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }}, + {Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack", + Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }}, + } + + if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil { + t.Fatalf("dispatch returned error: %v", err) + } + if ran != "tf plan" { + t.Fatalf("routed to %q, want %q", ran, "tf plan") + } + if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) { + t.Fatalf("command got args %v, want %v", gotArgs, want) + } +} + +func TestDispatchUnknownCommandErrors(t *testing.T) { + reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}} + if err := dispatch(reg, []string{"bogus"}); err == nil { + t.Fatal("expected error for unknown command, got nil") + } +} + +// The manifest is the progressive-discovery entrypoint: one line per command +// showing the full verb path, its tier, and summary, sorted for stable output. +func TestManifestTextListsEveryCommandWithTier(t *testing.T) { + reg := []Command{ + {Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"}, + {Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"}, + } + out := manifestText(reg) + for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} { + if !strings.Contains(out, want) { + t.Errorf("manifest text missing %q\n---\n%s", want, out) + } + } + // sorted: claim (c) must appear before tf plan (t) + if strings.Index(out, "claim") > strings.Index(out, "tf plan") { + t.Errorf("manifest not sorted by path:\n%s", out) + } +} + +func TestManifestJSONIsParsableAndTagged(t *testing.T) { + reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}} + out, err := manifestJSON(reg) + if err != nil { + t.Fatalf("manifestJSON error: %v", err) + } + var got []map[string]string + if err := json.Unmarshal([]byte(out), &got); err != nil { + t.Fatalf("manifest JSON not parsable: %v\n%s", err, out) + } + if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" { + t.Fatalf("unexpected manifest JSON: %v", got) + } +} diff --git a/cli/homelab.go b/cli/homelab.go new file mode 100644 index 00000000..62c0c8aa --- /dev/null +++ b/cli/homelab.go @@ -0,0 +1,98 @@ +package main + +import ( + "fmt" + "strings" +) + +// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z". +var version = "dev" + +// buildRegistry returns every homelab verb. New verb-groups append here. +func buildRegistry() []Command { + var reg []Command + reg = append(reg, claimCommands()...) + reg = append(reg, tfCommands()...) + reg = append(reg, workCommands()...) + reg = append(reg, k8sCommands()...) + reg = append(reg, memoryCommands()...) + reg = append(reg, ciCommands()...) + reg = append(reg, deployCommands()...) + reg = append(reg, netCommands()...) + reg = append(reg, obsCommands()...) + reg = append(reg, usageCommands()...) + reg = append(reg, haCommands()...) + reg = append(reg, browserCommands()...) + reg = append(reg, vaultCommands()...) + return reg +} + +// dispatchTop handles the homelab verb surface. handled=false means the args are +// not a homelab verb, so main() falls back to the legacy -use-case path. +func dispatchTop(args []string) (handled bool, err error) { + if len(args) == 0 { + fmt.Print(usage()) + return true, nil + } + switch args[0] { + case "help", "-h", "--help": + fmt.Print(usage()) + return true, nil + case "version", "--version": + fmt.Println("homelab " + version) + return true, nil + case "manifest": + reg := buildRegistry() + if containsArg(args[1:], "--json") { + out, err := manifestJSON(reg) + if err != nil { + return true, err + } + fmt.Println(out) + return true, nil + } + fmt.Print(manifestText(reg)) + return true, nil + } + if strings.HasPrefix(args[0], "-") { + return false, nil + } + reg := buildRegistry() + if !isCommandGroup(reg, args[0]) { + return false, nil + } + return true, dispatch(reg, args) +} + +func isCommandGroup(reg []Command, group string) bool { + for _, c := range reg { + if len(c.Path) > 0 && c.Path[0] == group { + return true + } + } + return false +} + +func containsArg(args []string, want string) bool { + for _, a := range args { + if a == want { + return true + } + } + return false +} + +func usage() string { + var b strings.Builder + fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version) + b.WriteString("Usage:\n homelab [args]\n\nCommands:\n") + for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") { + if line != "" { + b.WriteString(" " + line + "\n") + } + } + b.WriteString("\n manifest [--json] list all commands (machine-readable with --json)\n") + b.WriteString(" version print version\n") + b.WriteString("\nLegacy webhook use-cases remain available via -use-case=.\n") + return b.String() +} diff --git a/cli/k8s.go b/cli/k8s.go new file mode 100644 index 00000000..3a2d0a5d --- /dev/null +++ b/cli/k8s.go @@ -0,0 +1,138 @@ +package main + +import ( + "fmt" + "os/exec" + "strings" +) + +// kubectl helpers use the ambient kubeconfig (no per-call auth flags). + +func kubectlBase(ns string, args ...string) []string { + var full []string + if ns != "" { + full = append(full, "-n", ns) + } + return append(full, args...) +} + +func kubectlStream(ns string, args ...string) error { + return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...) +} + +// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods). +func kubectlCapture(ns string, args ...string) (string, error) { + out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output() + return strings.TrimSpace(string(out)), err +} + +// k8sTarget is the parsed `` + selectors shared by the k8s verbs. +type k8sTarget struct { + app string + ns string + pod string + container string + selector string + tty bool + rest []string // passthrough flags and, after `--`, the exec command +} + +// parseK8sTarget reads ` [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`. +// The first bare token is the app; unknown flags pass through in rest. +func parseK8sTarget(args []string) k8sTarget { + t := k8sTarget{} + i := 0 + take := func() string { + if i+1 < len(args) { + i++ + return args[i] + } + return "" + } + for i = 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--": + t.rest = append(t.rest, args[i+1:]...) + return t + case a == "-n" || a == "--namespace": + t.ns = take() + case strings.HasPrefix(a, "--namespace="): + t.ns = strings.TrimPrefix(a, "--namespace=") + case a == "--pod": + t.pod = take() + case strings.HasPrefix(a, "--pod="): + t.pod = strings.TrimPrefix(a, "--pod=") + case a == "-c" || a == "--container": + t.container = take() + case strings.HasPrefix(a, "--container="): + t.container = strings.TrimPrefix(a, "--container=") + case a == "-l" || a == "--selector": + t.selector = take() + case strings.HasPrefix(a, "--selector="): + t.selector = strings.TrimPrefix(a, "--selector=") + case a == "--tty" || a == "-it" || a == "-ti": + t.tty = true + case !strings.HasPrefix(a, "-") && t.app == "": + t.app = a + default: + t.rest = append(t.rest, a) + } + } + return t +} + +// namespace defaults to the app name (most namespaces hold exactly one app). +func (t k8sTarget) namespace() string { + if t.ns != "" { + return t.ns + } + return t.app +} + +// objectRef is the kubectl object for logs/exec: an explicit pod, else +// deploy/ (kubectl resolves a pod from the Deployment). +func (t k8sTarget) objectRef() string { + if t.pod != "" { + return "pod/" + t.pod + } + return "deploy/" + t.app +} + +// --- database access (the dbaas exec pattern) --- + +type dbPlan struct { + ns string + pod string // explicit pod (e.g. mysql-standalone-0) + selector string // resolve the pod by this label when pod == "" (CNPG primary) + container string // "" = default container + argv []string // command + args to run inside the pod +} + +// planDBExec builds the in-pod command to run sql against app's database. +// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a +// Service, not an exec target), psql -U postgres -d . +// MySQL: mysql-standalone-0, password from env (never on the command line). +// dbName defaults to app. sql empty => interactive client. +func planDBExec(app, dbName, sql string, mysql bool) dbPlan { + if dbName == "" { + dbName = app + } + if mysql { + inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName)) + if sql != "" { + inner += " -e " + shellQuote(sql) + } + return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}} + } + argv := []string{"psql", "-U", "postgres", "-d", dbName} + if sql != "" { + argv = append(argv, "-tAc", sql) + } + return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv} +} + +// shellQuote single-quotes s for safe embedding in a bash -c string. +func shellQuote(s string) string { + return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'" +} diff --git a/cli/k8s_test.go b/cli/k8s_test.go new file mode 100644 index 00000000..cfa356bc --- /dev/null +++ b/cli/k8s_test.go @@ -0,0 +1,65 @@ +package main + +import ( + "reflect" + "strings" + "testing" +) + +func TestParseK8sTarget(t *testing.T) { + got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"}) + want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}} + if !reflect.DeepEqual(got, want) { + t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want) + } +} + +func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) { + if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" { + t.Errorf("namespace() = %q, want immich", ns) + } + if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" { + t.Errorf("namespace() = %q, want dbaas", ns) + } +} + +func TestK8sTargetObjectRef(t *testing.T) { + if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" { + t.Errorf("objectRef() = %q, want deploy/tripit", r) + } + if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" { + t.Errorf("objectRef() = %q, want pod/tripit-abc", r) + } +} + +func TestPlanDBExecPostgresDefault(t *testing.T) { + p := planDBExec("fire-planner", "", "SELECT 1", false) + // pg-cluster-rw is a Service, so the PG plan resolves the primary POD by + // label rather than naming an (un-exec-able) Service. + if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" { + t.Fatalf("unexpected pg target: %+v", p) + } + // db name defaults to the app; SQL passed via -tAc + joined := strings.Join(p.argv, " ") + if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") { + t.Fatalf("pg argv missing db/sql: %v", p.argv) + } +} + +func TestPlanDBExecMysqlEnvPassword(t *testing.T) { + p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true) + if p.pod != "mysql-standalone-0" { + t.Fatalf("unexpected mysql pod: %+v", p) + } + inner := strings.Join(p.argv, " ") + // password must come from the env var, never inline + if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) { + t.Fatalf("mysql must use env password wrapper: %v", p.argv) + } +} + +func TestShellQuoteEscapes(t *testing.T) { + if got := shellQuote("a'b"); got != `'a'\''b'` { + t.Fatalf("shellQuote = %q", got) + } +} diff --git a/cli/main.go b/cli/main.go index 3b9fee1c..a53f7672 100644 --- a/cli/main.go +++ b/cli/main.go @@ -26,8 +26,16 @@ var ( ) func main() { - err := run() - if err != nil { + // homelab verb surface (work/tf/claim/...) is tried first; if the args are + // not a homelab verb, fall through to the legacy webhook -use-case path. + if handled, err := dispatchTop(os.Args[1:]); handled { + if err != nil { + fmt.Fprintln(os.Stderr, "homelab: "+err.Error()) + os.Exit(1) + } + return + } + if err := run(); err != nil { glog.Errorf("run failed: %s", err.Error()) os.Exit(255) } diff --git a/cli/memory.go b/cli/memory.go new file mode 100644 index 00000000..286ee5bb --- /dev/null +++ b/cli/memory.go @@ -0,0 +1,103 @@ +package main + +import ( + "bytes" + "encoding/json" + "fmt" + "io" + "net/http" + "os" + "strings" + "time" +) + +// defaultMemoryURL is used when no env override is present (agents normally have +// CLAUDE_MEMORY_API_URL set by the memory hooks). +const defaultMemoryURL = "https://claude-memory.viktorbarzin.me" + +type memoryClient struct { + base string + key string + http *http.Client +} + +func firstEnv(keys ...string) string { + for _, k := range keys { + if v := os.Getenv(k); v != "" { + return v + } + } + return "" +} + +func resolveMemoryBase() string { + if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" { + return strings.TrimRight(b, "/") + } + return defaultMemoryURL +} + +// newMemoryClient talks straight to the claude-memory HTTP API (the same backend +// the MCP wraps), so it works even when the MCP frontend is down. +func newMemoryClient() (*memoryClient, error) { + key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY") + if key == "" { + return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)") + } + return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil +} + +func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) { + var r io.Reader + if body != nil { + b, err := json.Marshal(body) + if err != nil { + return nil, err + } + r = bytes.NewReader(b) + } + req, err := http.NewRequest(method, c.base+path, r) + if err != nil { + return nil, err + } + req.Header.Set("Authorization", "Bearer "+c.key) + if body != nil { + req.Header.Set("Content-Type", "application/json") + } + resp, err := c.http.Do(req) + if err != nil { + return nil, err + } + defer resp.Body.Close() + out, _ := io.ReadAll(resp.Body) + if resp.StatusCode >= 300 { + return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out))) + } + return out, nil +} + +// Request bodies mirror src/claude_memory/api/models.py. + +type memRecallReq struct { + Context string `json:"context"` + ExpandedQuery string `json:"expanded_query,omitempty"` + Category string `json:"category,omitempty"` + SortBy string `json:"sort_by,omitempty"` + Limit int `json:"limit,omitempty"` +} + +type memStoreReq struct { + Content string `json:"content"` + Category string `json:"category,omitempty"` + Tags string `json:"tags,omitempty"` + ExpandedKeywords string `json:"expanded_keywords,omitempty"` + Importance float64 `json:"importance"` + ForceSensitive bool `json:"force_sensitive,omitempty"` +} + +type memUpdateReq struct { + Content *string `json:"content,omitempty"` + Tags *string `json:"tags,omitempty"` + Importance *float64 `json:"importance,omitempty"` + ExpandedKeywords *string `json:"expanded_keywords,omitempty"` +} diff --git a/cli/memory_test.go b/cli/memory_test.go new file mode 100644 index 00000000..7b14ef20 --- /dev/null +++ b/cli/memory_test.go @@ -0,0 +1,51 @@ +package main + +import ( + "encoding/json" + "os" + "strings" + "testing" +) + +func TestResolveMemoryBase(t *testing.T) { + old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL") + defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }() + + os.Unsetenv("CLAUDE_MEMORY_API_URL") + os.Unsetenv("MEMORY_API_URL") + if got := resolveMemoryBase(); got != defaultMemoryURL { + t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL) + } + os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed + if got := resolveMemoryBase(); got != "https://m.example" { + t.Errorf("resolveMemoryBase() = %q, want https://m.example", got) + } +} + +func TestMemStoreReqAlwaysSendsImportance(t *testing.T) { + b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5}) + s := string(b) + if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) { + t.Fatalf("memStoreReq JSON missing fields: %s", s) + } +} + +func TestMemUpdateReqOmitsUnsetFields(t *testing.T) { + tags := "a,b" + b, _ := json.Marshal(memUpdateReq{Tags: &tags}) + s := string(b) + if strings.Contains(s, "content") || strings.Contains(s, "importance") { + t.Fatalf("unset update fields must be omitted: %s", s) + } + if !strings.Contains(s, `"tags":"a,b"`) { + t.Fatalf("set field missing: %s", s) + } +} + +func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) { + b, _ := json.Marshal(memRecallReq{Context: "hi"}) + s := string(b) + if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") { + t.Fatalf("empty optionals must be omitted: %s", s) + } +} diff --git a/cli/presence.go b/cli/presence.go new file mode 100644 index 00000000..bcf054d7 --- /dev/null +++ b/cli/presence.go @@ -0,0 +1,58 @@ +package main + +import ( + "fmt" + "os" + "path/filepath" + "strings" +) + +// validPresenceKinds is the fixed label taxonomy accepted by the presence board. +var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"} + +// presenceScript locates the presence CLI — homelab WRAPS it, it does not +// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence. +func presenceScript() string { + if p := os.Getenv("HOMELAB_PRESENCE"); p != "" { + return p + } + home, err := os.UserHomeDir() + if err != nil { + return "presence" + } + return filepath.Join(home, "code", "scripts", "presence") +} + +// validateLabel checks a presence label is : with a known kind. +func validateLabel(label string) error { + parts := strings.SplitN(label, ":", 2) + if len(parts) != 2 || parts[0] == "" || parts[1] == "" { + return fmt.Errorf("label must be : (e.g. stack:vault), got %q", label) + } + for _, k := range validPresenceKinds { + if parts[0] == k { + return nil + } + } + return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", ")) +} + +// presenceClaim claims label on the board with a purpose note. +func presenceClaim(label, purpose string) error { + if err := validateLabel(label); err != nil { + return err + } + args := []string{"claim", label} + if purpose != "" { + args = append(args, "--purpose", purpose) + } + return runStreaming(presenceScript(), args...) +} + +// presenceRelease releases a prior claim on label. +func presenceRelease(label string) error { + if err := validateLabel(label); err != nil { + return err + } + return runStreaming(presenceScript(), "release", label) +} diff --git a/cli/presence_test.go b/cli/presence_test.go new file mode 100644 index 00000000..3d1596e1 --- /dev/null +++ b/cli/presence_test.go @@ -0,0 +1,24 @@ +package main + +import "testing" + +func TestValidateLabelAcceptsTaxonomy(t *testing.T) { + good := []string{ + "stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster", + "infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data", + } + for _, l := range good { + if err := validateLabel(l); err != nil { + t.Errorf("validateLabel(%q) = %v, want nil", l, err) + } + } +} + +func TestValidateLabelRejectsBadLabels(t *testing.T) { + bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""} + for _, l := range bad { + if err := validateLabel(l); err == nil { + t.Errorf("validateLabel(%q) = nil, want error", l) + } + } +} diff --git a/cli/probe.go b/cli/probe.go new file mode 100644 index 00000000..25d148a0 --- /dev/null +++ b/cli/probe.go @@ -0,0 +1,76 @@ +package main + +import ( + "context" + "crypto/tls" + "fmt" + "io" + "net" + "net/http" + "net/url" + "os/exec" + "strings" + "time" +) + +// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it. +const internalLBIP = "10.0.20.203" + +// clientDialingIP returns an http.Client that dials ip for ANY host while keeping +// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve +// host:443:ip`. TLS verification is skipped (these are reachability/observability +// probes, not security checks; internal .lan vhosts may serve a non-matching cert). +func clientDialingIP(ip string, timeout time.Duration) *http.Client { + d := &net.Dialer{Timeout: 8 * time.Second} + tr := &http.Transport{ + DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) { + if i := strings.LastIndex(addr, ":"); i >= 0 { + addr = ip + addr[i:] + } + return d.DialContext(ctx, network, addr) + }, + TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, + } + return &http.Client{Timeout: timeout, Transport: tr} +} + +// probeURL issues a GET and returns status code + elapsed time. +func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) { + start := time.Now() + resp, err := c.Get(rawurl) + dur := time.Since(start) + if err != nil { + return 0, dur, err + } + resp.Body.Close() + return resp.StatusCode, dur, nil +} + +// lbGetBody GETs https://? through the internal LB and returns the body. +func lbGetBody(host, path string, q url.Values) ([]byte, error) { + u := "https://" + host + path + if len(q) > 0 { + u += "?" + q.Encode() + } + resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u) + if err != nil { + return nil, err + } + defer resp.Body.Close() + body, _ := io.ReadAll(resp.Body) + if resp.StatusCode >= 300 { + return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body))) + } + return body, nil +} + +// dig runs `dig +short` against a resolver, optionally for a record type. +func dig(name, server, rrtype string) (string, error) { + args := []string{"+short", "+time=3", "+tries=1"} + if rrtype != "" { + args = append(args, rrtype) + } + args = append(args, name, "@"+server) + out, err := exec.Command("dig", args...).Output() + return strings.TrimSpace(string(out)), err +} diff --git a/cli/probe_test.go b/cli/probe_test.go new file mode 100644 index 00000000..bec4d132 --- /dev/null +++ b/cli/probe_test.go @@ -0,0 +1,49 @@ +package main + +import "testing" + +func TestQueryArg(t *testing.T) { + if got := queryArg([]string{"up"}, nil); got != "up" { + t.Errorf(`queryArg(["up"]) = %q, want "up"`, got) + } + if got := queryArg([]string{"up", "--json"}, nil); got != "up" { + t.Errorf(`--json should be dropped, got %q`, got) + } + // single quoted PromQL arrives as one token + if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" { + t.Errorf(`quoted query mangled: %q`, got) + } + // value-flags and their values are skipped, query survives + vf := map[string]bool{"--since": true, "--limit": true} + if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` { + t.Errorf(`value-flag skipping failed: %q`, got) + } +} + +func TestLabelStr(t *testing.T) { + got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"}) + if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted + t.Errorf("labelStr = %q", got) + } + if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" { + t.Errorf("labelStr (no __name__) = %q", got) + } +} + +func TestOneLineList(t *testing.T) { + if got := oneLineList(" "); got != "(none)" { + t.Errorf("empty = %q, want (none)", got) + } + if got := oneLineList("a\nb"); got != "a, b" { + t.Errorf("multi = %q, want 'a, b'", got) + } +} + +func TestHostOnly(t *testing.T) { + if got := hostOnly("foo.me/path"); got != "foo.me" { + t.Errorf("hostOnly = %q", got) + } + if got := hostOnly("foo.me"); got != "foo.me" { + t.Errorf("hostOnly = %q", got) + } +} diff --git a/cli/repo.go b/cli/repo.go new file mode 100644 index 00000000..3e0dc4f1 --- /dev/null +++ b/cli/repo.go @@ -0,0 +1,101 @@ +package main + +import ( + "os" + "os/exec" + "os/user" + "path/filepath" + "strings" +) + +// preferRemote picks the canonical remote: forgejo if present, else origin, +// else the first listed. (For infra, origin and forgejo both point at Forgejo.) +func preferRemote(remotes []string) string { + has := map[string]bool{} + for _, r := range remotes { + has[r] = true + } + switch { + case has["forgejo"]: + return "forgejo" + case has["origin"]: + return "origin" + case len(remotes) > 0: + return remotes[0] + default: + return "" + } +} + +// hasGitCryptAttr reports whether .gitattributes content enables git-crypt. +func hasGitCryptAttr(gitattributes string) bool { + return strings.Contains(gitattributes, "filter=git-crypt") +} + +// gitCryptFlags are the per-command flags that disable smudge/clean so git +// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config). +func gitCryptFlags() []string { + return []string{ + "-c", "filter.git-crypt.smudge=cat", + "-c", "filter.git-crypt.clean=cat", + "-c", "filter.git-crypt.required=false", + } +} + +// gitOutput runs `git -C dir ` and returns trimmed stdout. +func gitOutput(dir string, args ...string) (string, error) { + cmd := exec.Command("git", append([]string{"-C", dir}, args...)...) + out, err := cmd.Output() + return strings.TrimSpace(string(out)), err +} + +func gitRepoRoot(dir string) (string, error) { + return gitOutput(dir, "rev-parse", "--show-toplevel") +} + +// gitRemotes lists configured remote names for the repo at dir. +func gitRemotes(dir string) ([]string, error) { + out, err := gitOutput(dir, "remote") + if err != nil { + return nil, err + } + if out == "" { + return nil, nil + } + return strings.Split(out, "\n"), nil +} + +// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt. +func isGitCryptRepo(repoRoot string) bool { + b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes")) + if err != nil { + return false + } + return hasGitCryptAttr(string(b)) +} + +// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted, +// else nil. These are injected per-command and never persisted. +func cryptFlagsFor(repoRoot string) []string { + if isGitCryptRepo(repoRoot) { + return gitCryptFlags() + } + return nil +} + +// gitStream runs `git [cryptFlags] -C repoRoot ` with live output. +func gitStream(repoRoot string, cryptFlags []string, args ...string) error { + full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...) + return runStreamingIn("", "git", full...) +} + +// currentUser returns the OS username for branch naming (/). +func currentUser() string { + if u := os.Getenv("USER"); u != "" { + return u + } + if u, err := user.Current(); err == nil && u.Username != "" { + return u.Username + } + return "user" +} diff --git a/cli/repo_test.go b/cli/repo_test.go new file mode 100644 index 00000000..76cf21a7 --- /dev/null +++ b/cli/repo_test.go @@ -0,0 +1,37 @@ +package main + +import "testing" + +func TestPreferRemote(t *testing.T) { + cases := []struct { + in []string + want string + }{ + {[]string{"origin", "forgejo"}, "forgejo"}, + {[]string{"forgejo"}, "forgejo"}, + {[]string{"origin"}, "origin"}, + {[]string{"upstream"}, "upstream"}, + {nil, ""}, + } + for _, c := range cases { + if got := preferRemote(c.in); got != c.want { + t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want) + } + } +} + +func TestHasGitCryptAttr(t *testing.T) { + if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") { + t.Error("expected git-crypt detected") + } + if hasGitCryptAttr("*.md text\n*.png binary") { + t.Error("expected no git-crypt") + } +} + +func TestGitCryptFlagsShape(t *testing.T) { + f := gitCryptFlags() + if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" { + t.Fatalf("unexpected git-crypt flags: %v", f) + } +} diff --git a/cli/run.go b/cli/run.go new file mode 100644 index 00000000..22e7f17a --- /dev/null +++ b/cli/run.go @@ -0,0 +1,23 @@ +package main + +import ( + "os" + "os/exec" +) + +// runStreaming executes name with args, wiring std streams to this process so +// the caller sees live output, and returns the command's error (non-nil on +// non-zero exit — preserved so homelab's own exit code reflects the child's). +func runStreaming(name string, args ...string) error { + return runStreamingIn("", name, args...) +} + +// runStreamingIn is runStreaming with a working directory (empty = inherit). +func runStreamingIn(dir, name string, args ...string) error { + cmd := exec.Command(name, args...) + cmd.Dir = dir + cmd.Stdout = os.Stdout + cmd.Stderr = os.Stderr + cmd.Stdin = os.Stdin + return cmd.Run() +} diff --git a/cli/stack.go b/cli/stack.go new file mode 100644 index 00000000..1cfdd8d0 --- /dev/null +++ b/cli/stack.go @@ -0,0 +1,54 @@ +package main + +import ( + "fmt" + "os" + "path/filepath" + "sort" + "strings" +) + +// findInfraRoot walks up from start to the infra repo root — the directory +// holding both terragrunt.hcl and a stacks/ directory. +func findInfraRoot(start string) (string, error) { + dir := start + for { + if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) { + return dir, nil + } + parent := filepath.Dir(dir) + if parent == dir { + return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start) + } + dir = parent + } +} + +// resolveStack maps a bare stack name to its directory under /stacks. +func resolveStack(infraRoot, name string) (string, error) { + dir := filepath.Join(infraRoot, "stacks", name) + if isDir(dir) { + return dir, nil + } + avail := listStacks(infraRoot) + return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", ")) +} + +// listStacks returns the sorted names of every directory under /stacks. +func listStacks(infraRoot string) []string { + entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks")) + if err != nil { + return nil + } + var out []string + for _, e := range entries { + if e.IsDir() { + out = append(out, e.Name()) + } + } + sort.Strings(out) + return out +} + +func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() } +func isDir(p string) bool { fi, err := os.Stat(p); return err == nil && fi.IsDir() } diff --git a/cli/stack_test.go b/cli/stack_test.go new file mode 100644 index 00000000..2967dc18 --- /dev/null +++ b/cli/stack_test.go @@ -0,0 +1,52 @@ +package main + +import ( + "os" + "path/filepath" + "testing" +) + +func newInfraTree(t *testing.T, stacks ...string) string { + t.Helper() + root := t.TempDir() + if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil { + t.Fatal(err) + } + for _, s := range stacks { + if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil { + t.Fatal(err) + } + } + return root +} + +func TestFindInfraRootWalksUp(t *testing.T) { + root := newInfraTree(t, "vault") + got, err := findInfraRoot(filepath.Join(root, "stacks", "vault")) + if err != nil { + t.Fatalf("findInfraRoot error: %v", err) + } + if got != root { + t.Fatalf("findInfraRoot = %q, want %q", got, root) + } +} + +func TestFindInfraRootErrorsOutsideInfra(t *testing.T) { + if _, err := findInfraRoot(t.TempDir()); err == nil { + t.Fatal("expected error outside an infra checkout") + } +} + +func TestResolveStack(t *testing.T) { + root := newInfraTree(t, "vault", "monitoring") + dir, err := resolveStack(root, "vault") + if err != nil { + t.Fatalf("resolveStack error: %v", err) + } + if want := filepath.Join(root, "stacks", "vault"); dir != want { + t.Fatalf("resolveStack = %q, want %q", dir, want) + } + if _, err := resolveStack(root, "nonesuch"); err == nil { + t.Fatal("expected error for unknown stack") + } +} diff --git a/cli/telemetry.go b/cli/telemetry.go new file mode 100644 index 00000000..b0bb625a --- /dev/null +++ b/cli/telemetry.go @@ -0,0 +1,62 @@ +package main + +import ( + "bytes" + "encoding/json" + "net/http" + "os" + "strconv" + "strings" + "time" +) + +// usageJob is the Loki stream job label for homelab usage telemetry. +const usageJob = "homelab-usage" + +// emitUsage best-effort records one verb invocation to Loki for cross-user +// usage analytics. Labels are low-cardinality (job/user/verb); the line carries +// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must +// never affect the command: all errors are swallowed and a tight timeout bounds +// the cost. Opt out with HOMELAB_TELEMETRY=0. +func emitUsage(verb string, runErr error) { + switch os.Getenv("HOMELAB_TELEMETRY") { + case "0", "off", "false", "no": + return + } + if verb == "" || strings.HasPrefix(verb, "usage") { + return // don't self-record the analytics reader + } + exit := 0 + if runErr != nil { + exit = 1 + } + body, err := json.Marshal(lokiPush{Streams: []lokiStream{{ + Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb}, + Values: [][2]string{{ + strconv.FormatInt(time.Now().UnixNano(), 10), + "exit=" + strconv.Itoa(exit) + " ver=" + version, + }}, + }}}) + if err != nil { + return + } + req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body)) + if err != nil { + return + } + req.Header.Set("Content-Type", "application/json") + resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req) + if err != nil { + return + } + resp.Body.Close() +} + +type lokiPush struct { + Streams []lokiStream `json:"streams"` +} + +type lokiStream struct { + Stream map[string]string `json:"stream"` + Values [][2]string `json:"values"` +} diff --git a/cli/update_viktorbarzin_me.go b/cli/update_viktorbarzin_me.go index 1a693a25..c2c1d3f4 100644 --- a/cli/update_viktorbarzin_me.go +++ b/cli/update_viktorbarzin_me.go @@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error { if err != nil { return errors.Wrapf(err, "Error reading response") } - glog.Infof("Response:", string(responseBody)) + glog.Infof("Response: %s", string(responseBody)) return nil } diff --git a/cli/usage_test.go b/cli/usage_test.go new file mode 100644 index 00000000..052e080c --- /dev/null +++ b/cli/usage_test.go @@ -0,0 +1,18 @@ +package main + +import ( + "strings" + "testing" +) + +func TestUsageQuery(t *testing.T) { + got := usageQuery("30d", "") + want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))` + if got != want { + t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want) + } + withUser := usageQuery("7d", "emo") + if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") { + t.Errorf("usageQuery with user missing filter/range: %q", withUser) + } +} diff --git a/cli/woodpecker.go b/cli/woodpecker.go new file mode 100644 index 00000000..b3a48c20 --- /dev/null +++ b/cli/woodpecker.go @@ -0,0 +1,191 @@ +package main + +import ( + "context" + "encoding/json" + "fmt" + "io" + "net" + "net/http" + "os" + "os/exec" + "strings" + "time" +) + +// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik +// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`): +// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies. +const ( + wpHost = "ci.viktorbarzin.me" + wpLBIP = "10.0.20.203" +) + +type wpClient struct { + base string + token string + http *http.Client +} + +// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path. +func wpToken() string { + if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" { + return t + } + out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output() + if err != nil { + return "" + } + return strings.TrimSpace(string(out)) +} + +func newWPClient() (*wpClient, error) { + tok := wpToken() + if tok == "" { + return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)") + } + ip := firstEnv("HOMELAB_WP_IP") + if ip == "" { + ip = wpLBIP + } + dialer := &net.Dialer{Timeout: 8 * time.Second} + tr := &http.Transport{ + DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) { + if strings.HasPrefix(addr, wpHost+":") { + addr = ip + addr[strings.LastIndex(addr, ":"):] + } + return dialer.DialContext(ctx, network, addr) + }, + } + return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil +} + +// getJSON GETs path into v, retrying the transient empty/5xx responses the +// Woodpecker API intermittently returns under load. +func (c *wpClient) getJSON(path string, v interface{}) error { + var lastErr error + for attempt := 0; attempt < 5; attempt++ { + if attempt > 0 { + time.Sleep(2 * time.Second) + } + req, _ := http.NewRequest("GET", c.base+path, nil) + req.Header.Set("Authorization", "Bearer "+c.token) + resp, err := c.http.Do(req) + if err != nil { + lastErr = err + continue + } + body, _ := io.ReadAll(resp.Body) + resp.Body.Close() + if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 { + lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode) + continue + } + if resp.StatusCode >= 300 { + return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body))) + } + return json.Unmarshal(body, v) + } + return lastErr +} + +type wpPipeline struct { + Number int `json:"number"` + Status string `json:"status"` + Event string `json:"event"` + Commit string `json:"commit"` + Message string `json:"message"` +} + +func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) { + var ps []wpPipeline + err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps) + return ps, err +} + +// findPipeline returns the pipeline for commit (prefix match), or the latest when +// commit is empty. +func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) { + ps, err := c.recentPipelines(repoID, 25) + if err != nil { + return wpPipeline{}, err + } + if len(ps) == 0 { + return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID) + } + if commit == "" { + return ps[0], nil + } + for _, p := range ps { + if strings.HasPrefix(p.Commit, commit) { + return p, nil + } + } + return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps)) +} + +func (c *wpClient) repoID() (int, error) { + owner, repo, err := repoOwnerName() + if err != nil { + return 0, err + } + var r struct { + ID int `json:"id"` + } + if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil { + return 0, err + } + if r.ID == 0 { + return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo) + } + return r.ID, nil +} + +// repoOwnerName derives / from the cwd git remote. +func repoOwnerName() (string, string, error) { + cwd, _ := os.Getwd() + root, err := gitRepoRoot(cwd) + if err != nil { + return "", "", fmt.Errorf("not in a git repository: %w", err) + } + remote := preferRemote(remotesOrEmpty(root)) + url, err := gitOutput(root, "remote", "get-url", remote) + if err != nil { + return "", "", err + } + return parseOwnerRepo(url) +} + +// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL. +func parseOwnerRepo(url string) (string, string, error) { + u := strings.TrimSuffix(strings.TrimSpace(url), ".git") + u = strings.TrimSuffix(u, "/") + if i := strings.Index(u, "://"); i >= 0 { + u = u[i+3:] + } + u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo + parts := strings.Split(u, "/") + if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" { + return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url) + } + return parts[len(parts)-2], parts[len(parts)-1], nil +} + +func isTerminalStatus(s string) bool { + switch s { + case "success", "failure", "error", "killed", "declined", "blocked": + return true + } + return false +} + +func isFailureStatus(s string) bool { + return s == "failure" || s == "error" || s == "killed" || s == "declined" +} + +func min(a, b int) int { + if a < b { + return a + } + return b +} diff --git a/cli/woodpecker_test.go b/cli/woodpecker_test.go new file mode 100644 index 00000000..72c73c69 --- /dev/null +++ b/cli/woodpecker_test.go @@ -0,0 +1,40 @@ +package main + +import "testing" + +func TestParseOwnerRepo(t *testing.T) { + cases := []struct{ in, owner, repo string }{ + {"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"}, + {"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"}, + {"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"}, + {"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"}, + } + for _, c := range cases { + o, r, err := parseOwnerRepo(c.in) + if err != nil || o != c.owner || r != c.repo { + t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo) + } + } + if _, _, err := parseOwnerRepo("nonsense"); err == nil { + t.Error("expected error for unparseable remote") + } +} + +func TestStatusClassification(t *testing.T) { + for _, s := range []string{"success", "failure", "error", "killed"} { + if !isTerminalStatus(s) { + t.Errorf("%q should be terminal", s) + } + } + for _, s := range []string{"running", "pending"} { + if isTerminalStatus(s) { + t.Errorf("%q should not be terminal", s) + } + } + if !isFailureStatus("failure") || !isFailureStatus("error") { + t.Error("failure/error should classify as failure") + } + if isFailureStatus("success") { + t.Error("success must not classify as failure") + } +} diff --git a/docs/adr/0004-homelab-unified-cli.md b/docs/adr/0004-homelab-unified-cli.md new file mode 100644 index 00000000..27cce02a --- /dev/null +++ b/docs/adr/0004-homelab-unified-cli.md @@ -0,0 +1,30 @@ +# homelab: a unified infra-ops CLI grown in place from infra/cli + +Agents re-derive the same operational command boilerplate every session — mining +51,116 bash commands across 2,225 past sessions showed dense, repeated patterns +(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding +the deterministic, repeated **actions** (not judgment) agents run — composable in +bash, JSON-capable, and discovered progressively via `homelab manifest`. It is +grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups +alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION` +file (the infra repo deploys continuously and does not cut semver tags). + +## Considered options + +- **Its own top-level repo** (the original plan) — rejected in favour of keeping + it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the + Go source isn't git-crypt-encrypted and a provision-time build is unaffected by + GitOps continuous-deploy. +- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email + webhook use-cases. +- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the + recurring action surface (methodology skills; third-party/owned MCP such as + phpIPAM, which homelab does NOT duplicate). + +## Consequences + +- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the + in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs + and falls through to the legacy `-use-case` path verbatim. +- Distribution: built from source to `/usr/local/bin/homelab` during devvm + provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`. diff --git a/docs/adr/0005-homelab-v01-scope.md b/docs/adr/0005-homelab-v01-scope.md new file mode 100644 index 00000000..c1da7a95 --- /dev/null +++ b/docs/adr/0005-homelab-v01-scope.md @@ -0,0 +1,23 @@ +# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded + +v0.1 ships only the highest-volume surface — the infra inner-loop: `work` +(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/ +force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined +commands and where agents lose the most time and leak the most presence claims. + +v0.1 enforces **no** homelab-level permission gating: everything is allowed, +relying on existing gates (harness permission mode, presence claims, plan +approval). But every verb records a `read|write` tier (visible in `manifest`), so +a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added +later with zero restructuring. + +## Considered options + +- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad + value, but defers the toil that motivated the project. +- **One domain deep (k8s)** — cleanest template, narrow day-one value. + +We chose the highest-volume-but-write-heavy infra loop deliberately, accepting +the extra complexity (worktree lifecycle, git-crypt flag injection, presence +coupling, branch-protection PR fallback) for the biggest immediate toil +reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions. diff --git a/docs/adr/0006-homelab-work-and-tf.md b/docs/adr/0006-homelab-work-and-tf.md new file mode 100644 index 00000000..fcdddc30 --- /dev/null +++ b/docs/adr/0006-homelab-work-and-tf.md @@ -0,0 +1,29 @@ +# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply + +Four behaviours of the infra-loop verbs are surprising enough to record: + +1. **`work` owns worktree create/land/clean, but session *entry* delegates to the + native harness worktree tool.** A CLI is a child process and cannot change the + agent's working directory; `EnterWorktree` can. So `homelab work start ` + creates the worktree + branch off `/master` (git-crypt-aware) and + prints the path — the agent enters it with native `EnterWorktree({path})`. + +2. **`work land` is auto-land, but gated on verification.** It merges master in → + runs verification → pushes `HEAD:master` (fetch+merge+retry on + non-fast-forward) → falls back to pushing the feature branch for a PR when the + direct push is rejected (branch protection). It **refuses to push when it + cannot verify** (no `--verify-cmd` and no auto-detected suite) unless + `--no-verify` is passed — added after an accidental smoke-test land pushed + unverified WIP to master (benign: the infra CI applied 0 stacks because the + diff was `cli/`-only, but an unverified land must be deliberate, not default). + +3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.** + Local applies are out-of-band (CI applies canonically on push) but happen + constantly (~763× in the corpus). `tf apply ` auto-claims `stack:`, + delegates to `scripts/tg apply --non-interactive`, and **always releases on + exit** (normal, error, or signal via `sync.Once` + handler) — fixing the + documented ~200-claim leak — and prints an out-of-band reminder. + +4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that + arrives with the ci/deploy watch verb-group. It prints a reminder to follow + the pipeline manually. diff --git a/docs/adr/0007-homelab-k8s-verbs.md b/docs/adr/0007-homelab-k8s-verbs.md new file mode 100644 index 00000000..422b3431 --- /dev/null +++ b/docs/adr/0007-homelab-k8s-verbs.md @@ -0,0 +1,30 @@ +# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw + +v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far +(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more +than every other domain combined). + +It is built on an **app→namespace→pod resolver**: most namespaces hold exactly +one app, so `` defaults to the namespace, and the target defaults to +`deploy/` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/ +`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need +specificity. The CLI uses the ambient kubeconfig — no per-call auth flags. + +Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage), +`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`. + +## Decisions worth recording + +- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/ + `scale`/`create`). They stay raw `kubectl`, by design, per the repo's + Terraform-only policy — the corpus confirms they're low-frequency, and a + friendly verb would normalise a policy violation. +- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is + config mutation and forbidden; the verb cannot target them. +- **`db` encodes the dbaas exec pattern** (the single highest-value k8s + sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`, + `psql -U postgres -d `; MySQL via `mysql-standalone-0` with a + `bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from + the pod env and never appears on the command line. +- Read verbs were smoke-tested against the live cluster; write verbs are + unit-tested (resolver, db-plan, shell-quoting) but not fired at live state. diff --git a/docs/adr/0008-homelab-memory-verbs.md b/docs/adr/0008-homelab-memory-verbs.md new file mode 100644 index 00000000..60f13850 --- /dev/null +++ b/docs/adr/0008-homelab-memory-verbs.md @@ -0,0 +1,30 @@ +# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path + +v0.3 adds the memory verb-group so agents can search and navigate memory from the +CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth, +ingress `auth = "none"` so programmatic clients work) — the **MCP is just one +frontend over it**. `homelab memory` is a thin HTTP client over the same API, +using the env the hooks already set (`CLAUDE_MEMORY_API_URL` + +`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP +API directly, it **works even when the MCP frontend is down** — the recurring +MCP-disconnect problem that motivated claude-memory HA (and that took the MCP +offline for the entire session this was built in). + +Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`, +`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against +the live API including a store→recall→delete round-trip — full data-plane parity +with the MCP. + +## Deprecation path (deliberate follow-up — NOT done in v0.3) + +The MCP is more than tools: the **per-prompt auto-recall hook** and the +**auto-learn hook** run on every prompt for every agent. Deprecating it safely is +a separate, sequenced change: + +1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook + to `homelab memory store`. +2. Update the CLAUDE.md memory policy to point at the CLI. +3. Uninstall the MCP. + +Done CLI-first (verbs proven before touching the every-prompt path) so a +regression can't silently break auto-recall/auto-learn fleet-wide. diff --git a/docs/adr/0009-homelab-ci-deploy-verbs.md b/docs/adr/0009-homelab-ci-deploy-verbs.md new file mode 100644 index 00000000..51399997 --- /dev/null +++ b/docs/adr/0009-homelab-ci-deploy-verbs.md @@ -0,0 +1,29 @@ +# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration + +v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching +a build/deploy to completion), proven during the session that built it (hours +spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and +retrigger logic for a single CI incident). + +## Decisions + +- **API, not DB.** The verbs query the Woodpecker REST API (version-stable), + not its Postgres schema (which drifts across upgrades — column renames bit us + mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203` + while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go + equivalent of the house `curl --resolve` pattern). Token from + `WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd + git remote via `/api/repos/lookup//`. +- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx + under load (it flapped through the whole build session); `getJSON` retries + empties with backoff so `ci watch` is reliable exactly when it's needed. +- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch` + on the landed commit and fails if the pipeline does — closing the gap ADR-0005 + deferred. `--no-ci-watch` opts out. +- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for + the deployment image to reference the expected sha, *then* blocks on rollout + status (kubectl-based; reuses the k8s helpers). +- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log + endpoints were the least reliable this session (often empty); `status`/`watch` + rely on the list endpoint that works. A DB-backed `ci logs` is a possible + follow-up if the API path stays flaky. diff --git a/docs/adr/0010-homelab-net-obs-verbs.md b/docs/adr/0010-homelab-net-obs-verbs.md new file mode 100644 index 00000000..29a94a46 --- /dev/null +++ b/docs/adr/0010-homelab-net-obs-verbs.md @@ -0,0 +1,37 @@ +# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value + +v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit +test the user posed mid-build: *does the verb save reasoning, or only typing?* A +wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves +keystrokes but not thought. These four save thought — the reasoning they encode +is **which endpoint, reached how, with what auth/URL shape** — re-derived every +time otherwise. (That same test deprioritized `node ssh` aliasing and `secret +get`, which are thin wrappers; see the session discussion.) + +## Decisions + +- **Internal ingresses, reached via the LB.** Everything routes through the + Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the + Go form of the house `curl --resolve host:443:10.0.20.203` pattern + (`probe.go: clientDialingIP`). Verified live before building: Prometheus + (`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both + answer JSON over the LB with **no auth gate and no port-forward** — so these + stay clean HTTP clients, not kubectl wrappers. +- **`net check` is two-legged on purpose.** It resolves the host via public DNS + (→ Cloudflare) AND dials the internal LB, reporting both — because the useful + question is *where* a break is (CF edge vs the app vs the LB path), which a + single curl can't answer. The external leg forces public resolution (the devvm + resolver is split-horizon and would otherwise hit the LB for both). +- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.** + `prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and + Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing + alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series, + queryable through the working endpoint — so no new dependency. +- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2, + raw `*.svc` services) that would force port-forward/`kubectl run`. The + reasoning-savings there don't beat the added moving parts; kept out of scope. +- **No `node`/`secret` group.** Same test: their high-volume parts are + command-wrappers (low savings); only compound node ops (serial console, VM + wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt + unless a concrete pain surfaces — the high-value deterministic surface + (tf/work/ci/k8s/memory + these probes) is now covered. diff --git a/docs/adr/0011-homelab-usage-telemetry.md b/docs/adr/0011-homelab-usage-telemetry.md new file mode 100644 index 00000000..c383211b --- /dev/null +++ b/docs/adr/0011-homelab-usage-telemetry.md @@ -0,0 +1,34 @@ +# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction + +v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It +exists to answer the question that drove the whole CLI — *which verbs are worth +adding next* — with data instead of one maintainer's habits (the earlier mining +covered a single user's ~51k commands, so the surface is shaped to that user). + +## Decisions + +- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows + the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs + don't go through `dispatch()` (`manifest`/`version`/`help` are handled in + `dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so + the analytics reader doesn't pollute its own data. +- **Payload is deliberately minimal: verb path + exit code only.** Labels + `{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`. + **No args, paths, flags, hostnames, or secrets** ever leave the process — the + emit sees only the matched verb name, not the arguments. This is what makes + cross-user aggregation safe. +- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's + CLI writes its own invocations (attributed to its OS user) to the shared Loki + push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads + back with a LogQL metric query. This is the privacy-preserving resolution to + "what does everyone (e.g. another user) use" — it never touches anyone's + `~/.claude`, which the org per-user policy bars (see the per-user red-line in + managed-settings; reading another user's home is off-limits even for an owner + in-session — a fresh session under changed MDM policy is the only legitimate + path, and even then this telemetry is the better answer). +- **Best-effort, never affects the command.** All errors swallowed; an 800ms + client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry + must never slow or break the tool it measures. +- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs` + path (same host, same LB dial). Presence MySQL was the alternative (queryable + SQL) but would add a write dependency and creds; Loki needs neither. diff --git a/docs/adr/0012-homelab-ha-verbs.md b/docs/adr/0012-homelab-ha-verbs.md new file mode 100644 index 00000000..379f8ee5 --- /dev/null +++ b/docs/adr/0012-homelab-ha-verbs.md @@ -0,0 +1,54 @@ +# homelab Home Assistant verbs: token resolution + host SSH, not entity control + +v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA +operator's sessions: across ~1,900 shell commands the single most-repeated line +(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline, +and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as +a shell function ~30× — both re-derived from scratch every session. The existing +`home-assistant-sofia.py` already covers the *API*, but it goes unused from an +arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a +cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that +gap for every user in every directory. + +## Decisions + +- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already + does entity state and control (`get_state`, `call_service`, history, logs). + Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004) + — we do **not** reimplement `on`/`off`/`list`/`state`. We add only token + *resolution* and host *SSH*, neither of which an API-only MCP can provide. The + value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010). +- **`ha token` resolves live from the cluster, not from an env var.** It reads + the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` / + `london`) via the ambient kubeconfig. This is robust to env drift — the precise + failure that made agents re-derive the pipeline. Read-tier, prints the bare + token to stdout so it composes in `$(…)`, mirroring `memory secret`. +- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`). + It was originally read from `openclaw-secrets` → `skill_secrets` (a JSON blob + also holding `slack_webhook` + `uptime_kuma_password`), which only cluster + admins can read — so the verb hung/failed for the non-admin operator it was + built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose + OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only + the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to + the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence + the separate object). openclaw's own deployment keeps reading `openclaw-secrets` + — this is purely additive. +- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended + use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` + + `UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no + TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key + is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to + whoever first wrote the workflow; that user's key must be enrolled on the HA + host. Write-tier (runs an arbitrary remote command). +- **sofia is the default; london is structural.** The devvm sits on the Sofia + LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london + (`hassio@192.168.8.103`) is in the instance map so `ha token --instance london` + works (a pure secret read), but `ha ssh --instance london` generally won't + connect from here — london is remote. We model it correctly rather than + pretend it's reachable. +- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for + the endpoints the MCP/script don't cover — `/api/template`, `/reload`, + `check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is + already unblocked, and a generic passthrough overlaps the MCP. Re-measure via + `usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are + still hand-rolled often. diff --git a/docs/adr/0013-homelab-browser-verbs.md b/docs/adr/0013-homelab-browser-verbs.md new file mode 100644 index 00000000..bba4e8e7 --- /dev/null +++ b/docs/adr/0013-homelab-browser-verbs.md @@ -0,0 +1,75 @@ +# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome + +v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a +capability that already existed but was undiscoverable: driving the cluster's +**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on +`svc/chrome-service:9222`) from the devvm, for sites that detect and block +headless automation. + +## Motivating incident (2026-06-22) + +Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant +portal: the headless `@playwright/mcp` browser loaded the site and filled the +entire multi-step form, but the **final submit silently failed** — Fixflo's +pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the +spinner hung, no issue was created. Root cause = headless-Chrome detection. The +fix was to drive the headful `chrome-service` over `connect_over_cdp` — it +submitted first try (Fixflo ref IS22657587). That capability was documented +(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so +it took ~40 min, three redundant full form re-runs, and a user hint. The agent +also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead +of inspecting the network panel. + +## Decisions + +- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was + rejected: the CLI is run every session (so the verb is *discoverable*), is + versioned, multi-user, and test-covered. A private, untested skill is none of + those. The command owns only the deterministic *mechanics* (port-forward, + stealth injection, lifecycle) — the agent supplies the Playwright script, so + *judgment* stays out of the CLI (the founding rule, ADR-0004/0005). +- **The failure was judgment, not setup friction**, so the CLI is paired with a + one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic + payload in `browser --help`: the *when-to-use* signature (a site loads but a + gated action fails/hangs, or one request 500s/aborts while siblings 200 → + suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND` + = request resolved/intercepted by the automation layer, **not** egress; + egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` + and would break the page load too). A command the agent doesn't think to run is + useless; the cheat-sheet is the actual fix for the misdiagnosis. +- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to + localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222` + NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace + label. Readiness is asserted against `/json/version`: the endpoint must report + a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is + **always** torn down (process-group kill + signal handler), on success and on + error — an acceptance requirement. +- **Default to a fresh incognito context; `--shared-context` opts into the warmed + profile.** chrome-service is a single shared browser with a persistent profile. + A fresh, always-closed context is safe for concurrent callers (tripit's fare + scrape connects per-quote) and is what production already does. The warmed + persistent profile (cookies from a manual noVNC login) is opt-in for flows that + need a pre-logged-in session. +- **Pin the node CDP client to `playwright-core@1.48.2`** to match the + chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`, + Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol + changes between Playwright minors — the devvm's ambient Python Playwright was + 1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet + regardless of local drift. `playwright-core` (not `playwright`) because no + browser binary is needed — we connect to the remote one. +- **Self-provision the client lazily, no per-user setup.** The pinned client is + installed once into `~/.cache/homelab/browser-client/` (idempotent, version- + guarded) on first use, alongside the embedded runner + stealth files. node is + already fleet-wide; this avoids coupling the feature to a provisioner change + and keeps it self-contained and self-healing. The client runs on the devvm, so + `setInputFiles` streams local files to the remote browser over CDP — no + `chmod`/staging-dir workaround on the CDP path. +- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte + copy of `stacks/chrome-service/files/stealth.js` (the source of truth the + in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts. + `go:embed` can't reach outside the package dir, hence the vendored copy rather + than a path reference. +- **Scope held at two action verbs + help.** `run` (arbitrary script — the + workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover + the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure + via `usage top` (ADR-0011) before adding more. diff --git a/docs/adr/0014-service-identity-and-east-west-observability.md b/docs/adr/0014-service-identity-and-east-west-observability.md new file mode 100644 index 00000000..5eb1c83a --- /dev/null +++ b/docs/adr/0014-service-identity-and-east-west-observability.md @@ -0,0 +1,29 @@ +--- +status: accepted +date: 2026-06-24 +--- + +# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh + +As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now. + +## Considered options + +- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted. +- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected. +- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed. +- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected. +- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected. +- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected. +- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static. + +## Consequences + +- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod. +- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost. +- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying). +- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list. +- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS). +- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency. +- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**. +- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary. diff --git a/docs/architecture/authentication.md b/docs/architecture/authentication.md index 8de844de..9decc8dc 100644 --- a/docs/architecture/authentication.md +++ b/docs/architecture/authentication.md @@ -108,31 +108,6 @@ All new users must use an invitation link to register. The invitation-enrollment Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience. -### TripIt External self-signup (open enrollment, fenced) - -Unlike every other app, **TripIt allows open public self-signup** for people -outside the homelab (ADR-0020 in the tripit repo; runbook -`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment` -flow (email + passkey, no password) creates the account and stamps it into the -parentless **`TripIt External`** group. Containment is two-layered: - -- **Forward-auth apps**: a branch prepended to the `admin-services-restriction` - catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and - denies every other `auth="required"` host. -- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth). - External users are contained because every sensitive OIDC app already requires a - trusted group they do not hold — audited 2026-06-15: - Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo → - `Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove → - `Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless - `default`-policy token) and is bound to **`Allow Login Users`** as part of this - change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC). - -**Invariants**: keep `TripIt External` parentless (never under `Allow Login -Users`); keep the catch-all branch first; never co-assign `TripIt External` to a -trusted/internal user; the `tripit-enrollment` user_write "Create users group" -setting is the keystone that tags every signup. - ### OIDC Applications Authentik provides OIDC for 10 applications: diff --git a/docs/architecture/automated-upgrades.md b/docs/architecture/automated-upgrades.md index 0b8837cb..c0200d84 100644 --- a/docs/architecture/automated-upgrades.md +++ b/docs/architecture/automated-upgrades.md @@ -319,7 +319,7 @@ each Job's pod and its drain target are always different nodes. - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout. - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently. - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor. - - `K8sUpgradeChainJobFailed` — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). + - `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge. - **Pushgateway metrics**: - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight) - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB) diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index 7b95d4a0..6f9c1ee4 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -112,17 +112,32 @@ External caller (dev box): @playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json ``` +## Browser binary — real Google Chrome (for proprietary codecs) + +The chrome-service container runs **real Google Chrome**, not the bundled +Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser` +(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` + +`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`). +The launch resolves `CHROMIUM=/opt/google/chrome/chrome`. + +**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**, +so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with +`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no +decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always +worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just +the lib stripped) and Chrome-for-Testing is also codec-less — only +`google-chrome-stable` carries them. + ## Image pin -Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in -`stacks/chrome-service/main.tf`) and the Python client -(`playwright==1.48.0` in callers' `requirements.txt`) **must match -minor-versions**. Bump in lockstep — Playwright protocol changes between -minors and the client cannot connect to a mismatched server. - -The harvester + snapshot-server sidecar use -`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright -minor, with Python-side bindings pre-installed. +The Playwright base + the Python client (`playwright==1.48.0` in callers' +`requirements.txt`) and the snapshot sidecars +(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match +minor-versions. The chrome-service browser is now real Google Chrome (a newer +milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit +fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is +version-tolerant — verified working against this Chrome. If a future Chrome +milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients. ## Storage @@ -167,7 +182,29 @@ minor, with Python-side bindings pre-installed. `x11vnc` (connected to Xvfb on `localhost:6099`) bridged to `websockify` on port 6080. Service `chrome` maps :80 → :6080 and is exposed via `ingress_factory` at `chrome.viktorbarzin.me`, - Authentik-gated. + Authentik-gated. The bare host serves `vnc.html` (image symlinks + `index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify` + to skip the Connect button. The view is **black when no browser window is + open** (idle) — that is normal, not a failed connection. Chrome is launched + with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen + (no window manager runs, so without it Chrome opens at its profile-persisted + size and the rest of the framebuffer shows as a black cut-off). + +### noVNC fd-sweep gotcha (stuck "Connecting") + +If the noVNC client hangs on **"Connecting" forever then times out**, the cause +is almost always x11vnc's fd-table sweep: containerd grants pods +`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on +every client connection, so the RFB handshake never completes (websockify +accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends +the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n +x11vnc)/limits` (huge = bad) and time the handshake from a sibling container +(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"` — +healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts** +— done both in `files/novnc/entrypoint.sh` (root) and via the container `command` +wrapper in `main.tf` (so it applies deterministically even though the image is +`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix +as the android-emulator stack. - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`, bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088 @@ -180,6 +217,45 @@ minor, with Python-side bindings pre-installed. See `stacks/chrome-service/README.md` for the recipe (label namespace, inject `CHROME_CDP_URL`, vendor `stealth.js`). +## Driving from OUTSIDE the cluster (`homelab browser`) + +Agents on the devvm reach this browser through the **`homelab browser`** CLI +(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc +`connect_over_cdp` recipe. It is the **escalation path, not the default**: +agents default to the Playwright MCP / headless browser for all routine +automation, and reach for `homelab browser` ONLY when headless is blocked — a +site loads but a gated action (submit/login) silently fails or hangs, the +signature of headless / anti-bot detection. (Same tiered rule lives in +`~/code/CLAUDE.md` and `homelab browser --help`.) + +```text +devvm: homelab browser run flow.js + │ kubectl port-forward svc/chrome-service :9222 (random local port) + ▼ + http://127.0.0.1: ──► chrome-service pod :9222 (CDP) + │ assert /json/version Browser is "Chrome/…", not "HeadlessChrome" + │ node + playwright-core@1.48.2 → connectOverCDP + │ context.addInitScript(stealth.js) ← same vendored file as in-cluster + │ run the user's Playwright script with page/context/browser in scope + └─ port-forward always torn down (success or error) +``` + +Key facts: + +- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels + API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client` + label — unlike in-cluster callers. +- **Client pinned to the image minor.** The node client is + `playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed + lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the + server image bumps (same rule as the in-cluster Python clients — see "Image + pin" above). +- **Default context is a fresh incognito one** (closed on exit), safe for the + shared browser; `--shared-context` reuses the warmed persistent profile. +- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a + byte-identical copy of `files/stealth.js`, guarded by a drift test — so the + CLI's stealth never diverges from the in-cluster callers'. + ## Limits + risks - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index 1c78950f..35e041e6 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -116,7 +116,7 @@ instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`), fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website, k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, -audiobook-search, council-complaints) now also land on ghcr. +audiobook-search) now also land on ghcr. ### Infra-owned images (issues #29 / #30) diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index a5dec0af..3c75a345 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -146,7 +146,7 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia **Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network. -**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`. +**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`. **Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours. diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md index 27d856ef..c64a146c 100644 --- a/docs/architecture/multi-tenancy.md +++ b/docs/architecture/multi-tenancy.md @@ -543,10 +543,16 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. +**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.) + +**Agent skills — vendored own-copies for an allowlist (2026-06-23):** beyond the config-inheritance base (above, which symlinks the admin's `~/.claude/skills` into every user), the reconcile's `install_skills` gives users on the `SKILL_USERS` allowlist (currently `emo`) their OWN copies of a curated skill set vendored in-repo at `scripts/workstation/claude-skills/` (16: the admin's 15 `mattpocock/skills` + `find-skills` from `vercel-labs/skills`). It copies each into `~/.agents/skills/` (owned by the user, parent `~/.agents` chowned too — `install -d` leaves intermediates root-owned) and points `~/.claude/skills/` at it with a **relative** symlink (`../../.agents/skills/` — the layout `skills add -g` produces; Claude Code reads `~/.claude/skills/`). **Vendored, NOT `npx skills add`:** upstream drifted off this exact set (`diagnose`→`diagnosing-bugs`, `write-a-skill`→`writing-great-skills` renamed; `caveman` + `zoom-out` unpublished), so npx can't reproduce it — and a per-reconcile GitHub clone + unpinned-CLI dependency has no place in the hourly root job; refresh by re-snapshotting (`claude-skills/README.md`). **if-absent keys on the user's OWN copy** (a real dir under `~/.agents/skills`), so a steady-state reconcile is a no-op AND a stale or cross-user `~/.claude/skills` symlink is healed to the own copy — emo had `grill-me`/`file-issue` symlinked into the admin's home; `grill-me` is now emo's own (`file-issue` is outside the set, left as-is). A real dir squatting a name is never clobbered. Best-effort tail (`return 0`, like `install_memory`). Extend coverage = edit `SKILL_USERS`. + **Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it). **Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`). +**Claude authentication — per-user, self-renewing, Vault-recoverable (2026-06-20):** every roster user logs in with their OWN Enterprise identity; shared `CLAUDE_CODE_OAUTH_TOKEN` injection was removed because environment auth outranks local login and collapses identity/audit/quota. Claude owns access-token refresh in `~/.claude/.credentials.json`. A system template timer (`claude-auth-sync@.timer`, every 6h) renews a dedicated 32-day periodic Vault token, validates Claude with real non-persistent Haiku inference (`auth status` can lie during a 401), backs up only `claudeAiOauth` to `secret/workstation/claude-users/`, and performs one atomic Vault restore/retry on failure while preserving `mcpOAuth`. Vault policy `workstation-claude-` isolates every path; the roster generates policies for present and future users. A hard refresh-token revocation still requires the affected person to complete SSO—there is no supported noninteractive bypass. Loki alert `WorkstationClaudeAuthInvalid` surfaces exhausted recovery. Runbook: `../runbooks/claude-auth-renew-workstation.md`. + **Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:/mcp`). Mechanism: **system-level template units** `playwright-mcp@.service` + `playwright-snapshot-refresh@.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`. **Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `/` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`). @@ -561,7 +567,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume ` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api` `POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore `, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring. -**Status (2026-06-10):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, and **per-user `code_layout` with the ancamilea workspace cutover (infra → `~/code/infra`, `tripit` alongside, 2026-06-10)**. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept** (they ARE the shared-base delivery mechanism — the plan's step to remove them is obsolete). **Remaining (held / future):** the offboarding apply-side (Phase 7), the rest of per-user MCP/auth injection (`ha` + `claude_memory` + `.credentials.json` + beads Dolt cred — **per-user playwright browser MCP done 2026-06-16**, see above), and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning. +**Status (2026-06-20):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, **per-user `code_layout` with the ancamilea workspace cutover**, per-user playwright browser MCP, and per-user Claude OAuth renewal/Vault recovery. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept**. **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user `ha`/`claude_memory`/beads credential injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning. ## Related diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index e2c0ac2d..4659038a 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -4,7 +4,7 @@ Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS ## Overview -The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. +The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. ## Architecture Diagram @@ -16,12 +16,14 @@ graph TB Traefik[Traefik Ingress
3 replicas + PDB] subgraph "Middleware Chain" - CS[CrowdSec Bouncer
fail-open] + AntiAI[Anti-AI bot-block
fail-open] Auth[Authentik Forward-Auth
3 replicas + PDB] RL[Rate Limiter
429 response] Retry[Retry
2 attempts, 100ms] end + CSdrop[CrowdSec drop
nftables / CF edge
out-of-band, pre-Traefik] + subgraph "Proxmox Host (eno1)" vmbr0[vmbr0 Bridge
192.168.1.127/24] vmbr1[vmbr1 Internal
VLAN-aware] @@ -53,8 +55,9 @@ graph TB Internet -->|DNS query| CF CF -->|CNAME to tunnel| CFD CFD --> Traefik - Traefik --> CS - CS --> Auth + CSdrop -.->|banned IPs dropped before Traefik| Traefik + Traefik --> AntiAI + AntiAI --> Auth Auth --> RL RL --> Retry Retry --> Service @@ -82,7 +85,7 @@ graph TB | Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me | | Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding | | Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled | -| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer | +| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | IP reputation. Out-of-band enforcement: `cs-firewall-bouncer` DaemonSet (in-kernel nftables drop, direct hosts) + Cloudflare edge WAF rule (proxied hosts). Fail-open | | Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware | | MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 | | Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 | @@ -208,24 +211,31 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up ### Ingress Flow +CrowdSec is **not** a step in this chain — banned IPs are dropped before the +request ever reaches Traefik (Cloudflare edge WAF rule on proxied hosts; host +nftables on direct hosts). The flow below is for a request that survives that +out-of-band gate. + ```mermaid sequenceDiagram participant Client - participant Cloudflare + participant CFedge as Cloudflare (edge WAF: crowdsec_ban block) participant Cloudflared participant Traefik - participant CrowdSec + participant AntiAI participant Authentik participant RateLimit participant Retry participant Service participant Pod - Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me - Cloudflare->>Cloudflared: Forward via tunnel (QUIC) + Client->>CFedge: HTTPS request to blog.viktorbarzin.me + Note over CFedge: banned IP → blocked here (proxied hosts) + CFedge->>Cloudflared: Forward via tunnel (QUIC) Cloudflared->>Traefik: HTTP to LoadBalancer IP - Traefik->>CrowdSec: Apply bouncer middleware - CrowdSec->>Authentik: If allowed, check auth (protected=true) + Note over Traefik: on direct hosts, banned IPs already dropped in-kernel (nftables forward hook) + Traefik->>AntiAI: anti-AI bot-block (fail-open) + AntiAI->>Authentik: If allowed, check auth (protected=true) Authentik->>RateLimit: If authenticated, check rate limit RateLimit->>Retry: If within limit, continue Retry->>Service: Forward to Service @@ -234,24 +244,27 @@ sequenceDiagram Service-->>Retry: Response Retry-->>RateLimit: Response RateLimit-->>Authentik: Response (strip auth headers) - Authentik-->>CrowdSec: Response - CrowdSec-->>Traefik: Response + Authentik-->>AntiAI: Response + AntiAI-->>Traefik: Response Traefik-->>Cloudflared: Response - Cloudflared-->>Cloudflare: Response via tunnel - Cloudflare-->>Client: HTTPS response + Cloudflared-->>CFedge: Response via tunnel + CFedge-->>Client: HTTPS response ``` ### Middleware Chain -Every ingress created by the `ingress_factory` module follows this chain: +CrowdSec IP-reputation enforcement is **not** in this chain — it is out-of-band +(host nftables on direct hosts; the Cloudflare edge WAF `crowdsec_ban` rule on +proxied hosts), so banned IPs never reach the chain and there is no per-request +CrowdSec hop. Every ingress created by the `ingress_factory` module follows this +Traefik chain: -1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages. +1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`). 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. 3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load). 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). Additional middleware: -- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents. - **HTTP/3 (QUIC)**: Enabled globally on Traefik. ### Entrypoint Transport Timeouts @@ -348,7 +361,7 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac | pfSense | `stacks/pfsense/` | VM + cloud-init config | | Technitium | `stacks/technitium/` | Deployment, Service, PVC | | Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs | -| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer | +| CrowdSec | `stacks/crowdsec/` (+ edge in `stacks/rybbit/`) | Helm release, LAPI + agent; `cs-firewall-bouncer` DaemonSet (nftables, direct hosts) + Cloudflare edge sync (proxied hosts) | | Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs | | MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool | | Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) | @@ -436,13 +449,30 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac **Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare. -### Why Fail-Open on CrowdSec Bouncer? +### Why CrowdSec Enforcement Is Out-of-Band (and Fails Open) -**Alternatives considered**: -1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic. -2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages. +CrowdSec used to enforce inline as a Traefik middleware (the +`crowdsec-bouncer-traefik-plugin`). On Traefik 3.7.5 the Yaegi plugin handler was +never invoked, so it enforced nothing; the plugin was removed and enforcement +moved off the request path entirely (full history in +`docs/architecture/security.md`). It now runs on two surfaces: -**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on. +- **Direct hosts** → `cs-firewall-bouncer` DaemonSet drops banned IPs in the host + nftables, in **both the `input` and `forward` hooks**. The `forward` hook is + the load-bearing one: with Traefik on a dedicated LB IP at + `externalTrafficPolicy=Local`, client packets are DNAT'd to the Traefik **pod** + and transit the node's `forward` chain (not `input`) — which is exactly why the + ingress must preserve the **real client IP** end-to-end (ETP=Local + PROXY-v2 + for IPv6; see the Traefik LB IP and IPv6 ingress notes above). Without the real + client IP the firewall-bouncer (and the CF edge rule) would have nothing to + match on. +- **Proxied hosts** → a Cloudflare edge WAF rule (`ip.src in $crowdsec_ban`) fed + by the `crowdsec-cf-sync` CronJob. + +Both **fail open**: if LAPI is unreachable, the firewall-bouncer simply stops +receiving new decisions (existing drops persist) and the CF sync skips a run — +neither ever blocks legitimate traffic. Availability > strict bot blocking, and +out-of-band enforcement adds **zero per-request latency** (no Traefik hop). ### Why HTTP/3 (QUIC)? @@ -473,9 +503,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac **Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available. -**Diagnosis**: Middleware chain is blocking traffic. Check: -1. Authentik status: `kubectl get pod -n authentik` -2. CrowdSec LAPI status: `kubectl get pod -n crowdsec` +**Diagnosis**: Middleware chain is blocking traffic. (CrowdSec is **not** in the +chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Check: +1. Authentik status: `kubectl get pod -n authentik` (ForwardAuth fails closed if the auth server is unreachable) +2. `bot-block-proxy` status: `kubectl get pod -n traefik -l app=bot-block-proxy` (anti-AI ForwardAuth target — also fails closed if down) 3. Traefik logs: `kubectl logs -n kube-system deploy/traefik` **Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware. diff --git a/docs/architecture/security.md b/docs/architecture/security.md index a832113b..7d3043ea 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -2,40 +2,50 @@ ## Overview -The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation. +The homelab implements defense-in-depth security using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). CrowdSec enforcement is **out-of-band** (not a per-request Traefik hop — see the CrowdSec section): banned IPs are dropped in-kernel via nftables on direct hosts, and blocked at the Cloudflare edge on proxied hosts, so enforcement adds **zero per-request latency**. All security components fail open (a CrowdSec outage stops new bans but never blocks legitimate traffic). Security policies are deployed in audit mode first, then selectively enforced after validation. ## Architecture Diagram +CrowdSec enforcement is out-of-band (NOT an inline Traefik middleware hop). The +Traefik request chain is anti-AI → Authentik ForwardAuth → rate-limit → retry; +CrowdSec drops banned IPs *before* (direct hosts) or *off* (proxied hosts) that +chain entirely. + ```mermaid -graph LR +graph TB Internet[Internet] - CF[Cloudflare WAF] + + subgraph "Proxied hosts (orange-cloud)" + CFedge[Cloudflare edge
WAF rule: ip.src in $crowdsec_ban → block] + end + subgraph "Direct hosts (grey-cloud / internal)" + NFT[Host nftables
table crowdsec/crowdsec6
drop in input + forward] + end + Tunnel[Cloudflared Tunnel] - CrowdSec[CrowdSec Bouncer
Traefik Plugin] - AntiAI[Anti-AI Check
poison-fountain] - ForwardAuth[Authentik ForwardAuth] - RateLimit[Rate Limit Middleware] - Retry[Retry Middleware
2 attempts, 100ms] + Traefik[Traefik
anti-AI → Authentik → rate-limit → retry] Backend[Backend Service] LAPI[CrowdSec LAPI
3 replicas] - Agent[CrowdSec Agent] + Agent[CrowdSec Agent
parses Traefik logs] + FWB[cs-firewall-bouncer
DaemonSet, every node] + CFsync[crowdsec-cf-sync
CronJob, every 2 min] - Internet -->|1| CF - CF -->|2| Tunnel - Tunnel -->|3| CrowdSec - CrowdSec -.->|Query| LAPI - Agent -.->|Report| LAPI - CrowdSec -->|4. Pass/Block| AntiAI - AntiAI -->|5. Human/Bot| ForwardAuth - ForwardAuth -->|6. Authenticated| RateLimit - RateLimit -->|7. Under Limit| Retry - Retry -->|8. Success/Retry| Backend + Internet -->|proxied| CFedge + Internet -->|direct| NFT + CFedge -->|allowed| Tunnel + Tunnel --> Traefik + NFT -->|allowed| Traefik + Traefik --> Backend - style CrowdSec fill:#f9f,stroke:#333 - style AntiAI fill:#ff9,stroke:#333 - style ForwardAuth fill:#9f9,stroke:#333 - style RateLimit fill:#99f,stroke:#333 + Agent -.->|report| LAPI + LAPI -.->|all decisions incl. CAPI| FWB + FWB -.->|program drop rules| NFT + LAPI -.->|ban/captcha decisions, CAPI excluded| CFsync + CFsync -.->|push IP list| CFedge + + style CFedge fill:#f9f,stroke:#333 + style NFT fill:#f9f,stroke:#333 ``` ## Components @@ -44,7 +54,8 @@ graph LR |-----------|---------|----------|---------| | CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) | | CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection | -| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check | +| cs-firewall-bouncer | v0.0.34 | `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | In-kernel nftables drop on every node (DIRECT hosts). Bouncer key `firewall` | +| crowdsec-cf-sync | — | `stacks/rybbit/crowdsec_edge.tf` | LAPI→Cloudflare-IP-List sync CronJob (PROXIED hosts). Bouncer key `kvsync` | | Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control | | poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service | | cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management | @@ -54,11 +65,15 @@ graph LR ### Request Security Layers -Every incoming request passes through 6 security layers: +CrowdSec IP-reputation enforcement happens **before** a request reaches the +Traefik chain (banned IPs are dropped in-kernel on direct hosts, or blocked at +the Cloudflare edge on proxied hosts — see CrowdSec Threat Intelligence below). +A request that survives that out-of-band gate then passes through the Traefik +middleware chain: -1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external) -2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP -3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error) +1. **Cloudflare WAF / edge** - DDoS protection, bot detection, firewall rules incl. the CrowdSec `crowdsec_ban` block rule (proxied hosts only) +2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP (proxied hosts) +3. **CrowdSec out-of-band drop** - nftables on direct hosts; *not* a Traefik hop (zero per-request latency) 4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17) 5. **Authentik ForwardAuth** - Authentication check (if `protected = true`) 6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach) @@ -80,11 +95,71 @@ CrowdSec operates in a hub-and-agent model: - Reports malicious IPs to LAPI - Shares threat intel with CrowdSec community (anonymized) -**Traefik Bouncer Plugin**: -- Integrated as Traefik middleware -- Queries LAPI for IP reputation on each request -- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation) -- Blocks IPs on ban list, allows others +Enforcement is split across **two out-of-band surfaces**, neither of which adds +any per-request latency. (See "Why the Traefik bouncer plugin was removed" below +for the supersession history — there is no longer an inline Traefik bouncer.) + +**Surface 1 — DIRECT (non-Cloudflare-proxied) hosts → in-kernel nftables drop** +(`cs-firewall-bouncer` DaemonSet, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`): +- Runs on **every node** (no nodeSelector). Programs the HOST nftables — `table ip + crowdsec` / `table ip6 crowdsec6` — with drop rules in **both the `input` AND + the `forward` hooks**. The `forward` hook is required because Traefik is a + LoadBalancer with `externalTrafficPolicy=Local`: client traffic is DNAT'd to the + Traefik **pod** and transits the node's `forward` hook (not `input`) with the + real client IP preserved. Chains use `policy accept` (only set members drop — + it can never blackhole normal traffic). +- Pulls **all** decisions from LAPI, **including the CAPI community blocklist + (~31k IPs)**. Packets from banned IPs are dropped **in-kernel before reaching + Traefik** → zero per-request hops, no Traefik involvement at all. +- **Packaging**: cs-firewall-bouncer publishes no container image, so the + **v0.0.34** static binary is fetched at runtime by an initContainer onto a + `debian:bookworm-slim` runtime container. Needs `hostNetwork` + + `NET_ADMIN`/`NET_RAW` to talk netlink directly. Registered bouncer key: + **`firewall`**. +- **Fail-open**: if LAPI is unreachable it just stops receiving new decisions + (existing drop rules persist); it never blocks legitimate traffic. + +**Surface 2 — PROXIED (Cloudflare orange-cloud) hosts → Cloudflare edge block** +(`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`): +- Proxied hosts terminate at the Cloudflare edge, so a host-level nftables drop + would never see them. Enforcement is instead a single Cloudflare Rules List + **`crowdsec_ban`** + a zone-scoped WAF custom rule `(ip.src in $crowdsec_ban)` + → **block** action, which covers every proxied host in the zone. +- Fed by the **`crowdsec-cf-sync` CronJob** (namespace `rybbit`, every 2 min, + pure-stdlib Python in a ConfigMap). It pulls local **ban/captcha ip-scoped** + decisions and pushes them into the CF list, but **EXCLUDES the ~31k CAPI + community blocklist** — that set is far too large for a CF Rules List (the CF + account hard-limits to **one** list), and CAPI is already covered in-kernel on + direct hosts and by Cloudflare's own managed protections on proxied hosts. + Registered bouncer key: **`kvsync`**. +- **Block-only**: the single-list limit precludes a separate + captcha/managed-challenge list, so both ban and captcha decisions are enforced + as a plain block at the edge. +- **Auth carve-out:** the WAF rule excludes `authentik.viktorbarzin.me` + + `public-auth.viktorbarzin.me` (`… and not (http.host in {…})`). A CrowdSec hit + must never wall a user out of the login / WebAuthn flow they authenticate + through; auth keeps `traefik-rate-limit` for brute-force protection. + +**Whitelist** (`stacks/crowdsec/whitelist.yaml`): a CrowdSec whitelist covers +RFC1918 + the tailnet + internal CIDRs (plus one specific external IP), so +internal users are never enforced. Internal access uses split-horizon DNS +straight to Traefik, and direct internal clients are RFC1918 — both whitelisted. + +#### Why the Traefik bouncer plugin was removed + +Enforcement used to run as an inline Traefik middleware — the +`crowdsec-bouncer-traefik-plugin` (Yaegi/Lua), which queried LAPI on every +request and could serve a Cloudflare Turnstile captcha for soft remediations. +On **Traefik 3.7.5 the Yaegi handler was never invoked**, so the bouncer was +registered but enforced **nothing** despite appearing healthy. Rather than chase +the Yaegi runtime, the whole plugin path was **removed** (2026-06): the plugin +static config + initContainer download, the `crowdsec` Middleware CRD, the +`captcha.html` template + its ConfigMap and volume mount, and the Cloudflare +Turnstile widget (`cloudflare_turnstile_widget.crowdsec_captcha`). It was +replaced by the two out-of-band surfaces above, which add zero per-request +latency and fail open. (The earlier `crowdsec-cf-sync` cursor-pagination / +IP-List-capacity issues are also moot now that CAPI is excluded from the edge +list and dropped in-kernel instead.) **Metabase** (disabled by default): - Dashboard for CrowdSec analytics @@ -330,10 +405,12 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.** | Path | Purpose | |------|---------| -| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config | +| `stacks/crowdsec/` | CrowdSec LAPI, agent config + `whitelist.yaml` | +| `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | cs-firewall-bouncer DaemonSet (in-kernel nftables drop, direct hosts) | +| `stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py` | Cloudflare IP-List + WAF block rule + LAPI→CF sync CronJob (proxied hosts) | | `stacks/kyverno/` | Kyverno deployment + policies | | `stacks/poison-fountain/` | Anti-AI service + CronJob | -| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions | +| `stacks/traefik/modules/traefik/middleware.tf` | Security middleware definitions (no longer includes a CrowdSec bouncer) | | `stacks/platform/modules/ingress_factory/` | Per-service security toggles | ### Vault Paths @@ -443,7 +520,11 @@ spec: **Fix**: 1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list` 2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip ` -3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` + — the in-kernel drop clears as soon as `cs-firewall-bouncer` reconciles (direct + hosts); for proxied hosts the `crowdsec-cf-sync` CronJob removes it from the + `crowdsec_ban` CF list within ~2 min. +3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` (RFC1918 + tailnet + + internal CIDRs are already whitelisted, so internal clients are never banned). ### Kyverno Policy Blocking Deployment diff --git a/docs/plans/2026-06-07-multi-user-workstation-design.md b/docs/plans/2026-06-07-multi-user-workstation-design.md index 8e54fa95..4d80eae4 100644 --- a/docs/plans/2026-06-07-multi-user-workstation-design.md +++ b/docs/plans/2026-06-07-multi-user-workstation-design.md @@ -110,7 +110,7 @@ The Config base / machine-wide managed layer is **secret-free**. Everything carr | Auth / token | Lives in (per-user, `0600`) | New-user provisioning (from Vault) | |---|---|---| -| **Claude OAuth** | `~/.claude/.credentials.json` (or `CLAUDE_CODE_OAUTH_TOKEN`) | the shared Enterprise token (earlier decision) **or** own interactive login; emo keeps his own | +| **Claude OAuth** | `~/.claude/.credentials.json` + isolated Vault backup | own Enterprise SSO login; Claude refreshes locally and `claude-auth-sync@.timer` validates/backs up/recovers `claudeAiOauth` at `secret/workstation/claude-users/`; shared token injection is forbidden | | **`claude_memory` MCP** | `~/.claude.json` mcpServers + `MEMORY_API_KEY` in `settings.json` env | **DEFERRED — not a risk now (Viktor, 2026-06-08).** Per-user memory isolation needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78), not just a Vault write — NOT built now. For now a new user gets a simple key or omits memory; revisit if isolation becomes a concern. | | **`ha` MCP** (token-in-URL) | `~/.claude.json` | shared `ha_sofia_mcp_url` from Vault `secret/openclaw` (one HA instance; shared secret, per-user file) — only if HA-eligible | | **`playwright` MCP** | per-user systemd unit (own port) + localhost entry | existing per-user playwright pattern (id=4015); non-secret | diff --git a/docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md b/docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md new file mode 100644 index 00000000..64a28d1c --- /dev/null +++ b/docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md @@ -0,0 +1,243 @@ +# External Secrets Operator: 0.12.1 → 2.6.0 Migration (v1beta1 → v1) — Design Doc + +> **Status:** ✅ **COMPLETE (2026-06-22).** ESO at chart/app **2.6.0**; all 104 ExternalSecrets + 2 ClusterSecretStores on `external-secrets.io/v1`; 109 ESs SecretSynced (2 pre-existing dead); compat-gate now returns `OK: cluster is safe to upgrade to 1.35.6` (EXIT 0) — the last k8s-1.35 blocker is cleared. Executed Phase 1 (climb to 0.16.2) → Phase 2 (v1 rewrite, validated GC-survival on tandoor) → Phase 3 (climb 0.16.2→2.6.0 across the 0.17 cutoff, ES sync held at 109 every hop). Side-finding fixed: repo-wide stale `.terraform.lock.hcl` files (missing gavinbunney/kubectl + telmate/proxmox from the generated providers.tf) had broken `terragrunt apply` for ~28 stacks (this is what failed CI pipeline 332) — reconciled via `init -upgrade` + committed. +> **Scope:** Upgrade the ESO Helm chart `0.12.1` (app `v0.12.1`) to `2.6.0` (app `v2.6.0`) and migrate every `external-secrets.io/v1beta1` custom resource to `external-secrets.io/v1`. +> **Owner:** Viktor Barzin. **Author:** Claude (research + design only — no changes applied). +> +> **EXECUTION CORRECTION + STATUS (2026-06-21 — "let's do the ESO migration"):** The cluster is already on **k8s 1.34.9** (all 7 nodes), NOT ≤1.31 as §4.3 assumed. ESO 0.12 runs fine on 1.34 (the support-matrix bands are conservative *tested* ranges, not hard limits). **The entire ESO climb 0.12→2.6 therefore happens on k8s 1.34 — there is NO k8s interleave; IGNORE the "advance k8s to 1.32/1.33" steps in §4.3 / Phase 1 / Phase 3.** Only AFTER ESO reaches 2.x does the nightly version-check chain take k8s 1.34→1.35 (gate clears). Exact hop sequence (latest patch per minor): **0.13.0 → 0.14.4 → 0.15.1 → 0.16.2** [rewrite all 104 CRs to `v1` here] → **0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0**. Pre-flight done: CRD `storedVersions` are `["v1beta1"]` only (no v1alpha1 patch needed). +> +> **EXECUTION LOG:** +> - **✅ Phase 1 DONE (2026-06-21):** ESO climbed 0.12.1 → 0.13.0 → 0.14.4 → 0.15.1 → **0.16.2**, one hop at a time, each applied + verified (controller healthy; 108 live ExternalSecrets stayed SecretSynced; 2 pre-existing dead — `instagram-poster/instagram-poster-secrets` False since 2026-05-10, `payslip-ingest/payslip-ingest-secrets` False since 2026-04-25, both missing Vault data, untouched). Added `atomic=true` + `timeout=600` to the helm_release. At 0.16.2 **both `v1beta1` and `v1` are served** (110 each) and `storedVersions = ["v1beta1","v1"]`. Committed (`eso: Phase 1 …`); state auto-committed per hop by `scripts/tg`. +> - **⏳ Phase 2 PENDING — findings confirmed (decisive for execution):** (a) bumping a `kubernetes_manifest` ExternalSecret's apiVersion v1beta1→v1 **forces a REPLACE** (verified live on instagram-poster: `-/+ must be replaced`), NOT in-place. (b) Our ExternalSecrets use **`creationPolicy=Owner`** (default; confirmed on nextcloud) → target Secrets carry an ownerReference, so the replace's delete step can **cascade-GC the Secret** before ESO recreates it. → **Phase 2 must be done carefully, NOT a blind bulk apply:** (1) snapshot ALL target Secrets first (backstop); (2) **empirically validate on the FIRST live stack** — migrate one ES while watching its target Secret; ESO re-syncs the identical spec fast and should re-adopt before GC, but confirm before proceeding; (3) then the per-stack two-phase `-target`-then-full apply (the 15 plan-time-coupled stacks need `-target` first). If validation shows GC wins, pivot to `state rm` + `import {}` (adopts the already-v1-served object with zero delete → zero GC). Repo is clean at v1beta1 (the lone test edit was reverted, never applied). +> - **Phase 3 PENDING:** hops 0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0 (all on k8s 1.34, CRs already v1). Crossing **0.17 is the point of no return**. + +--- + +## 1. Goal & why + +ESO is the **last remaining compatibility gate blocking the autonomous k8s 1.35 upgrade** (Kyverno was cleared to 1.18.1 earlier today). The installed ESO `0.12.x` supports only Kubernetes **1.19 → 1.31** ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)); the k8s-version-check chain will refuse to advance the cluster past 1.31 while ESO sits at 0.12. The `2.x` series supports **k8s 1.34–1.35**, which clears the gate. + +The hard part is not the chart bump itself — it is that **ESO removed the `external-secrets.io/v1beta1` API**, and every one of our ExternalSecret / ClusterSecretStore resources is currently declared `v1beta1`. If we upgrade past the removal version without first rewriting the manifests to `v1`, ESO stops reconciling and synced Secrets go stale (apps keep their last-good Secret, but rotations and new secrets break). + +**Downtime tolerance:** brief, recoverable downtime of the ESO *controller* is acceptable. What must NOT happen is loss/corruption of the downstream Kubernetes `Secret` objects that apps mount (DB creds, API keys). Those must survive continuously. + +--- + +## 2. Current state + +### 2.1 Versions +| Component | Current | Target | +|---|---|---| +| Helm chart `external-secrets` | **0.12.1** | **2.6.0** | +| App / controller image | **v0.12.1** | **v2.6.0** | +| API version of all CRs | **`external-secrets.io/v1beta1`** | **`external-secrets.io/v1`** | +| Repo: `https://charts.external-secrets.io` | (unchanged) | (unchanged) | + +ESO stack: `stacks/external-secrets/main.tf`. `helm_release.external_secrets` pins `version = "0.12.1"`, namespace `external-secrets` (separate `kubernetes_namespace` resource, not `create_namespace`), and the **only** chart value set is `installCRDs = true` (via `yamlencode({ installCRDs = true })`). No webhook/replica/resource overrides. + +### 2.2 Inventory (live, from `stacks/`) +| Kind | Count | apiVersion | Where | +|---|---|---|---| +| **ExternalSecret** (`kubernetes_manifest`) | **104** | all `v1beta1` (0 mismatches) | 73 `.tf` files | +| **ClusterSecretStore** (definitions) | **2** | both `v1beta1` | `stacks/external-secrets/main.tf` | +| SecretStore | 0 | — | — | +| PushSecret | 0 | — | — | +| ClusterExternalSecret | 0 | — | — | + +- **Only ONE apiVersion string exists in the whole tree:** `external-secrets.io/v1beta1` (106 occurrences = 104 ExternalSecret + 2 ClusterSecretStore). Zero `v1`, zero `v1alpha1`. → a clean single-target rewrite. +- **`secretStoreRef` split:** 78 ExternalSecrets → `vault-kv`, 26 → `vault-database` (78 + 26 = 104). The `kind = "ClusterSecretStore"` string also appears inside every `secretStoreRef`, so a naive `grep 'kind = "ClusterSecretStore"'` returns 106 — only **2** are real store definitions. +- **22 files carry >1 ExternalSecret** (max: `stacks/fire-planner/main.tf` = 5; then wealthfolio / real-estate-crawler / phpipam / payslip-ingest / n8n / job-hunter / ebooks = 3 each; 13 files = 2). The 104-vs-73 gap is these multi-secret files. +- **Nested-module ExternalSecrets** (easy to miss when scripting the bump): `stacks/instagram-poster/modules/instagram-poster/main.tf`, `stacks/postiz/modules/postiz/main.tf`, `stacks/technitium/modules/technitium/main.tf`, `stacks/mailserver/modules/mailserver/main.tf`, `stacks/monitoring/modules/monitoring/grafana.tf`, `stacks/proxmox-csi/modules/proxmox-csi/main.tf`. +- **Docs are STALE:** `.claude/CLAUDE.md` says "43 ExternalSecrets + 9 DB-creds". Live count is **104 ExternalSecrets / 73 files / 26 db-refs**. Fix in the migration PR. + +### 2.3 The two ClusterSecretStores (`stacks/external-secrets/main.tf`) +Both `kubernetes_manifest`, both `external-secrets.io/v1beta1`, both `depends_on = [helm_release.external_secrets]`: +- **`vault-kv`** → Vault KV **v2** at `path = "secret"`, server `http://vault-active.vault.svc.cluster.local:8200`, auth `kubernetes` mount `kubernetes`, role `eso`, SA `external-secrets/external-secrets`. +- **`vault-database`** → identical except `path = "database"`, **`version = "v1"`** (Vault DB engine, KV-v1-style). + +ESO's Vault auth role `eso` (`stacks/vault/main.tf:486-511`): policy `eso-reader` (`secret/data/*` read+list, deny `secret/data/vault`, `database/static-creds/*` read), `token_ttl = token_period = 864000` (10d, periodic/auto-renew). + +### 2.4 Tier-0 / state +ESO is **Tier-0 (bootstrap)** (`.claude/CLAUDE.md` "Terraform State — Two-Tier Backend"; root `terragrunt.hcl` `tier0_stacks = ["infra","platform","cnpg","vault","dbaas","external-secrets"]`). Tier-0 ⇒ **local SOPS-encrypted state in git** (`state/stacks/external-secrets/terraform.tfstate`), NOT the PG backend. Workflow: `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`; SOPS decrypt via Vault Transit (primary) → age fallback. **Tier-0 must apply before PG is reachable**, so the ESO upgrade cannot depend on PG. + +### 2.5 Provider versions (`stacks/external-secrets/providers.tf`) +- `required_providers` declares **only** `vault = hashicorp/vault, ~> 4.0`. +- `provider "kubernetes"` and `provider "helm"` are declared **without version constraints** (resolve from root / `.terraform.lock.hcl`). The `helm` block already uses the **v3-style nested `kubernetes = {…}` argument** (not the legacy `kubernetes {}` block) ⇒ helm provider is **v3.x or v4.x** in the lockfile. **No `kubectl` provider** in this stack. No `required_version` pinned here. +- ⚠️ **Verify the resolved helm provider version** in `.terraform.lock.hcl` before starting — the prompt referenced `~> 4.0` for helm; the *stack* only pins that for `vault`. Either way the v3-syntax helm block + an SDK-v3 provider is compatible with the chart (see §4.5). + +### 2.6 Plan-time coupling (the cross-cutting risk) +**15 stacks read ESO-created Secrets at plan time** via `data "kubernetes_secret"` (avoids a Vault dependency at plan): `actualbudget, affine, changedetection, coturn, ebooks, fire-planner, freedify, freshrss, grampsweb, k8s-dashboard (dashboard_injector.tf), navidrome, owntracks, real-estate-crawler, servarr, technitium (modules/technitium)`. + +The documented **first-apply gotcha** (`.claude/CLAUDE.md`, `docs/architecture/secrets.md:360`, `stacks/fire-planner/main.tf:574`): the Secret must exist before the `data "kubernetes_secret"` plans, so on first creation you must `terragrunt apply -target=kubernetes_manifest.` first, then full apply. **Why this matters for the migration:** the `kubernetes_manifest` provider treats `apiVersion` as part of resource identity, so bumping `v1beta1`→`v1` **forces a replace** of all 104 ExternalSecrets. During replace there is a window where the new CR (and thus the synced Secret) may not yet be materialized when the same stack's `data "kubernetes_secret"` plans → the two-phase `-target` apply is needed **fleet-wide for the v1 rewrite step, not just fire-planner.** + +### 2.7 Vault DB rotation (rotation interplay) +`stacks/vault/main.tf`: **25 `vault_database_secret_backend_static_role`, every one `rotation_period = 604800` (7 days)** (8 MySQL + 17 PostgreSQL static roles). ESO syncs these via `vault-database` → `remoteRef.key = "static-creds/"`. Apps reading a rotated secret only at startup carry a Stakater Reloader annotation. **Implication:** any ESO controller downtime longer than the gap to the next rotation could leave a Secret stale across a rotation; keep controller downtime short and re-sync promptly. + +### 2.8 git-crypt landmine (adjacent, not in ESO stack) +`.claude/CLAUDE.md:146` + `docs/architecture/ci-cd.md:108` + `stacks/kyverno/modules/kyverno/tls-secret-sync.tf`: on a **git-crypt-locked clone**, `kubernetes_secret.tls_secret` reads `secrets/fullchain.pem`/`privkey.pem` via `file()` which returns **ciphertext**, corrupting the wildcard TLS secret Kyverno clones cluster-wide. **The ESO stack itself has NO `file()` reads of git-crypt secrets** — so this landmine does not bite the ESO upgrade directly. It is listed here only as a guardrail: do not piggyback unrelated kyverno applies during this work, and run all applies from an **unlocked** checkout. + +--- + +## 3. Target + +- Helm chart **`external-secrets` 2.6.0** (app **v2.6.0**), repo `https://charts.external-secrets.io`. +- All ExternalSecret + ClusterSecretStore CRs on **`external-secrets.io/v1`**. +- Cluster ESO compatible with **k8s 1.34–1.35** ⇒ unblocks the autonomous 1.35 upgrade. + +--- + +## 4. Key findings (the decisive facts) + +> Sourced from ESO official docs + GitHub release notes; verbatim quotes below. + +### 4.1 Chart version == app version (premise check) +The chart version and app version are released **in lockstep and are the same number**. `Chart.yaml`: `version: 0.12.1 / appVersion: v0.12.1`; `version: 2.6.0 / appVersion: v2.6.0`. The app series ran `…0.20.4 → 1.0.0 → … → 2.0.0 → … → 2.6.0`. **Crucially, the `v1.0.0` and `v2.0.0` APP releases are NOT the `external-secrets.io/v1` API** — `v1.0.0` is just "continuation after 0.20.4" (release diff `v0.20.4...v1.0.0`, no API change), and `v2.0.0`'s only breaking change is removing the unmaintained **Alibaba + Device42** providers (we use neither — only Vault). The API migration happened back at **0.16/0.17**. Source: [v1.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0) · [v2.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0). + +### 4.2 Version path: **NO skipping minors — step one minor at a time** +Official policy, verbatim ([stability-support](https://external-secrets.io/latest/introduction/stability-support/)): +> "**Upgrade version by version** — We strongly recommend upgrading one minor version at a time (e.g., 0.18.x → 0.19.x → 0.20.x) rather than skipping versions." + +Maintainer (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @gusfcarvalho): *"We are pre release… Every minor bump should be treated as a major bump until we go 1.0."* ⇒ **You CANNOT helm-upgrade 0.12.1 → 2.6.0 directly.** You must step each minor: `0.12 → 0.13 → 0.14 → 0.15 → 0.16 → 0.17 → 0.18 → 0.19 → 0.20 → 1.x → 2.x`. + +### 4.3 k8s ↔ ESO must advance roughly in lockstep +Each ESO release targets a **narrow** k8s band ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)): + +| ESO | k8s band | +|---|---| +| 0.12.x | 1.19 → 1.31 | +| 0.16.x | 1.32 | +| 0.17.x | 1.33 | +| 2.0 – 2.5 | 1.34 – 1.35 | +| 2.6 (latest) | (matrix row not yet appended; 2.x band is consistently 1.34–1.35 — see Open Questions) | + +**This is the single most important sequencing constraint.** ESO doesn't "support only ≤ its max k8s" in a wide range — older ESO may not run cleanly on a *much newer* k8s either. The bands imply the ESO upgrade and the k8s upgrade need to be **interleaved**, not "finish ESO, then bump k8s in one jump." Practical reading: the cluster is currently on k8s ≤1.31 (ESO 0.12 blocks past it). The 0.16/0.17 steps want k8s 1.32/1.33; the 2.x steps want 1.34/1.35. So this is a **coordinated ESO+k8s climb**, e.g. ESO→0.16 alongside k8s→1.32, ESO→0.17 alongside k8s→1.33, then ESO→2.x alongside k8s→1.34→1.35. (The k8s climb is itself sequential via the version-check chain; this doc focuses on the ESO half but flags the coupling — see Open Questions for who drives the interleave.) + +### 4.4 API migration: **must rewrite manifests to `v1` FIRST — there is NO v1beta1→v1 conversion webhook** +- **`external-secrets.io/v1` promoted to STORAGE version: v0.16.0.** v0.16.0 release notes "BREAKING CHANGES": *"Promotion of ExternalSecret/v1 and SecretStore/v1 and their cluster counterparts"* and *"Removal of Conversion Webhooks and …/v1alpha1…"*. From 0.16, **etcd stores `v1`**. Source: [v0.16.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0). +- **`external-secrets.io/v1beta1` STOPS BEING SERVED (hard cutoff): v0.17.0.** Verbatim ([v0.17.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0)): + > "v0.17.0 Stops serving `v1beta1` apis. You need to update your manifests from `v1beta1` to `v1` prior to updating from `v0.16` to `v0.17`. The only change needed is upgrading your manifests to `v1` (i.e. removing the `beta1` from `v1beta1`). … Be sure to do that to all your manifests prior to bumping to `v0.17.0`! `v0.16.2` already supports `v1` so this process should be smooth." +- **No v1beta1→v1 conversion webhook.** The only conversion webhook that ever existed was v1alpha1→v1beta1, **removed in 0.16**. Maintainer (issue [#5478](https://github.com/external-secrets/external-secrets/issues/5478), @gusfcarvalho): the post-0.16 "drift" is simply that etcd now stores v1 — *"This isn't really a conversion issue."* ⇒ **old v1beta1 manifests do NOT keep working past 0.17 via any auto-conversion.** + - **Verdict: MUST-REWRITE-FIRST.** Rewrite all CRs to `v1` while on **0.16.x** (which serves *both* v1beta1 and v1), then upgrade to 0.17. Real-world confirmation (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @Dutchy-): *"I was able to change v1beta1 to v1 on 0.16 without issues. After that I was able to upgrade to 0.17."* + - There is a deprecated escape hatch in chart 2.6.0 — `unsafeServeV1Beta1: true` re-enables v1beta1 serving for stragglers — but its own values comment says *"This flag will be removed on 2026.05.01"* (i.e. **already past**, do not rely on it). +- **Schema change is a PURE apiVersion string bump — ZERO field changes.** CRD `openAPIV3Schema` diff (v0.16.2 bundle, which serves both): ExternalSecret / SecretStore / ClusterSecretStore / ClusterExternalSecret have **byte-identical** spec field sets between v1beta1 and v1 (`{data, dataFrom, refreshInterval, refreshPolicy, secretStoreRef, target}` for ExternalSecret). Maintainer (issue #4785, @Skarlso): *"Just change your manifests to be v1 and upgrade… We don't have anything fancy that you need to do."* PushSecret only ever had `v1alpha1` (no v1beta1) — **unaffected** (we have 0 anyway). + +### 4.5 Helm chart values + CRD handling (0.12 → 2.6) +- **No top-level values removed or renamed.** `values.yaml` diff 0.12.1↔2.6.0 is **additive only** (new keys: `enableHTTP2, extraInitContainers, genericTargets, grafanaDashboard, hostAliases, hostUsers, leaderElectionID, livenessProbe, openshiftFinalizers, processClusterGenerator, processClusterPushSecret, processSecretStore, readinessProbe, strategy, systemAuthDelegator, vault`). Our single value `installCRDs = true` survives. +- **`installCRDs` still works** in 2.6.0 (defaults `true`, "install and upgrade CRDs through helm chart"). CRDs are **templated into the single `external-secrets` chart** and **upgraded by `helm upgrade`** automatically — there is **no separate CRDs subchart**, and no manual `kubectl apply` of CRDs is required by default. (Out-of-band bundle, if ever needed, lives at `deploy/crds/bundle.yaml` per release tag.) The only CRD-value change: `crds.conversion.enabled` defaults `true` in 0.12.1 (for the old v1alpha1 webhook) → `false` in 2.6.0 ("we stopped supporting v1alpha1"). We don't set it, so the new default is fine. +- **CRD storedVersions bookkeeping (the one real pre-flight check):** v0.16.0 notes warn to ensure no CRD still lists `v1alpha1` in `.status.storedVersions` before/at 0.16, with a `kubectl patch` to set it to `["v1","v1beta1"]` if needed. This is CRD metadata hygiene, NOT secret deletion. +- **Helm provider:** `Chart.yaml apiVersion: v2` (Helm 3 chart) in both 0.12.1 and 2.6.0; **no minimum Helm version declared** (only `kubeVersion: ">= 1.19.0-0"`). The Terraform helm provider on Helm SDK v3 (v3.x/v4.x) is compatible. **The 2.x chart does NOT require a newer helm provider than 0.12 did** — the v3-style helm block in `providers.tf` already satisfies it. (Still: pin/verify the resolved version in the lockfile; see Open Questions.) + +### 4.6 Data migration: **downstream Secrets survive** +The synced Kubernetes `Secret` objects are **not deleted or force-resynced** by these upgrades. The change is an apiVersion bump on the *custom resources*, whose `spec` is schema-identical, so the controller keeps reconciling the same target Secrets. A controller restart triggers a normal **reconcile (re-assert, not delete)**. Caveat: no release note says verbatim "synced Secrets are preserved"; the conclusion is from (a) schema identity, (b) maintainers calling it "100% compatible" (issue #5478), (c) absence of any "secrets recreated/deleted" note. **Standard caution: snapshot/back up all ESO-created Secrets before the 0.16→0.17 step** (see §8 verification). Unrelated watch-item: v0.14.0 flagged a stateful-**generators** change — we use no generators, so N/A. + +--- + +## 5. Migration strategy (ordered, do-this-then-that) + +> **Pre-reqs every step:** run from an **unlocked** infra checkout (git-crypt unlocked); `vault login -method=oidc`; ESO is **Tier-0** so use `scripts/tg plan` / `scripts/tg apply` against `stacks/external-secrets` and **`git push`** after each apply (SOPS state). Claim presence before each apply: `~/code/scripts/presence claim stack:external-secrets --purpose "ESO 0.12→2.x migration step N"`. Wait for the controller `Deployment` to roll out healthy before the next hop. + +### Phase 0 — Pre-flight (no changes) +1. Confirm cluster k8s version and the version-check chain's current target; **coordinate with the k8s climb** (see §4.3 / Open Questions). Decide who drives the interleave. +2. `kubectl get crd | grep external-secrets.io` and for each: `kubectl get crd -o jsonpath='{.status.storedVersions}'` — confirm none still list `v1alpha1`. If any do, plan the `kubectl patch …/status storedVersions=["v1beta1"]` per the v0.16.0 note (do this *before* reaching 0.16). +3. **Snapshot all ESO-managed Secrets** (rollback safety net): + `kubectl get externalsecrets -A` (record the 104) and `for ns/secret in : kubectl get secret -n -o yaml > backup/-.yaml`. Keep outside git-crypt or encrypt. +4. Inspect `.terraform.lock.hcl` in `stacks/external-secrets` — record resolved `helm` + `kubernetes` provider versions. If helm provider < what 2.6.0 needs (it doesn't appear to need anything beyond SDK v3), bump the constraint as its own committed change first. +5. Read `docs/architecture/secrets.md` + the fire-planner first-apply comment to re-confirm the `-target` pattern for the v1 rewrite step. + +### Phase 1 — Climb to 0.16.x (chart bump only, NO manifest change yet) +ESO `0.16.x` is the **transition version** that serves *both* v1beta1 and v1. Climb to it one minor at a time, leaving all CRs as `v1beta1`: +6. For `v` in `0.13.0, 0.14.0, 0.15.x, 0.16.2` (use latest patch of each minor): set `helm_release.external_secrets.version = ""`, `scripts/tg plan` (expect: chart upgrade + CRD upgrade in place; **no `kubernetes_manifest` replacements** — apiVersion unchanged), `scripts/tg apply`, `git push`, wait for rollout, verify `kubectl get externalsecrets -A` all `SecretSynced=True`. + - **Interleave k8s as required:** before/at 0.16 the cluster should be on **k8s 1.32** (0.16 band). Advance k8s via the normal version-check chain to 1.32 around this point. + - Watch the **0.14.0** notes (generators) — N/A for us, but eyeball the plan diff anyway. +7. **Land on 0.16.2 and STOP.** Verify both APIs are served: `kubectl get externalsecrets.v1.external-secrets.io -A` and `kubectl get externalsecrets.v1beta1.external-secrets.io -A` both work. + +### Phase 2 — Rewrite all 104 CRs + 2 stores to `v1` (while on 0.16.2) +This is the MUST-DO-FIRST API migration, done in the safe window where both versions are served. +8. **Mechanical rewrite** across `stacks/`: replace the apiVersion string `external-secrets.io/v1beta1` → `external-secrets.io/v1` in every ExternalSecret and ClusterSecretStore `kubernetes_manifest` (104 + 2 = 106 occurrences across 73 files, **including the 6 nested-module files** in §2.2). **No other field changes** (schema identical). Do this in a worktree, committed file-by-file. + - Leave `secretStoreRef.kind = "ClusterSecretStore"` (that's a kind reference, not an apiVersion — unaffected). +9. **Two-phase apply because `kubernetes_manifest` replace + plan-time `data "kubernetes_secret"`:** + a. **Stores first:** `scripts/tg apply -target='kubernetes_manifest.css_vault_kv' -target='kubernetes_manifest.css_vault_db'` in `stacks/external-secrets` (they get replaced to v1; ESO still serves v1beta1 too, so in-flight ExternalSecrets keep syncing). `git push`. + b. **ExternalSecrets, per stack:** for each of the 73 stacks, `scripts/tg apply -target=kubernetes_manifest.` FIRST (materializes the replaced v1 CR + its Secret), THEN a full `scripts/tg apply` for that stack (lets the 15 plan-time `data "kubernetes_secret"` reads resolve against the now-existing Secret). The **15 plan-time-coupled stacks** (§2.6) absolutely need the `-target` first; the rest are lower-risk but follow the same pattern for safety. `git push` per stack (Tier-1 stacks use PG state; ESO stack is Tier-0). + - Because the spec is identical, the *replace* re-creates an identical CR; ESO reconciles and re-asserts the same target Secret (no value change) → apps keep their Secret throughout. +10. **Verify the rewrite fully landed:** `grep -rc 'external-secrets.io/v1beta1' stacks/` returns **0**; `kubectl get externalsecrets -A -o jsonpath used to confirm all served as v1`; all `SecretSynced=True`; spot-check a rotated DB cred (e.g. `nextcloud-db-creds`) still valid. + +### Phase 3 — Cross the 0.17 cutoff, then climb to 2.6.0 +Only after Phase 2 is 100% applied (zero v1beta1 in repo AND in etcd): +11. Bump chart `0.16.2 → 0.17.x`. `scripts/tg plan` (expect chart/CRD upgrade; **no manifest replacements** — already v1), apply, push, rollout, verify all synced. **k8s should be 1.33** (0.17 band) around here. +12. Continue one minor at a time: `0.18.x → 0.19.x → 0.20.x → 1.0.0 → 1.x (latest) → 2.0.0 → … → 2.6.0`. At each: bump `version`, plan, apply, push, rollout, verify synced. **k8s reaches 1.34 then 1.35** across the 2.x steps. + - **At 2.0.0:** confirm the plan shows nothing odd from the Alibaba/Device42 provider removal (we use only Vault — should be a no-op). +13. **Land on 2.6.0.** Verify: controller image `v2.6.0`, all 104 ExternalSecrets `SecretSynced=True`, both ClusterSecretStores `Valid=True`. + +### Phase 4 — Close the gate + docs +14. Advance k8s to **1.35** via the version-check chain if not already; confirm the **compat-gate now lists ESO as compatible** and 1.35 is unblocked. +15. Update `.claude/CLAUDE.md` Secrets Management section: correct counts (**104 ExternalSecrets / 73 files / 26 db-refs**), apiVersion now `v1`. Update `docs/architecture/secrets.md`. Commit as part of the work (audit trail). + +--- + +## 6. Risks & mitigations + +| Risk | Likelihood | Mitigation | +|---|---|---| +| **Secret-sync outage → app DB/API auth failures** during controller restarts or the replace window | Med | Spec is identical so re-sync re-asserts the same value; keep each controller restart short; do Phase-2 replaces **per stack** (small blast radius); the 15 plan-time stacks use `-target` first so the Secret exists before dependents plan. Pre-step Secret snapshot (Phase 0.3) for instant restore. | +| **Crossing 0.17 with any CR still v1beta1** → ESO stops reconciling those, secrets go stale | High if rushed | Phase 2 gate: `grep -rc v1beta1 stacks/` **must be 0** AND `kubectl get …v1beta1…` returns nothing live before Phase 3. Do not skip 0.16. | +| **CRD removal/replace by helm dropping data** | Low | Chart manages CRDs in-place via `installCRDs=true` (upgrade, not delete-recreate); CRs are the data and they're untouched by a CRD *upgrade*. Snapshot anyway. Never `helm uninstall` (that can GC CRDs). | +| **No conversion webhook safety net** (must-rewrite-first) | Certain (by design) | Whole strategy is built on rewriting at 0.16. The deprecated `unsafeServeV1Beta1` is already past its 2026-05-01 removal — do NOT rely on it. | +| **`kubernetes_manifest` forces replace on apiVersion bump** → transient gap + plan-time read failures | High | Two-phase `-target` apply fleet-wide (Phase 2.9); identical spec ⇒ replacement CR is equivalent. | +| **Vault 7-day DB rotation lands mid-migration** → a Secret stale across rotation if controller down | Med | Keep controller downtime < rotation gap; re-sync immediately after each hop; Reloader annotations already re-roll pods on Secret change; if a rotation is imminent, sequence the affected db stacks last and verify those creds explicitly. | +| **git-crypt tls-secret-sync landmine** | Low (not in ESO stack) | ESO stack has no `file()` git-crypt reads; run from an **unlocked** checkout; do **not** piggyback kyverno applies during this work. | +| **helm/k8s provider in lockfile too old for 2.x chart** | Low | Phase 0.4 verify; bump constraint as a separate committed change if needed (chart needs only Helm SDK v3, already satisfied). | +| **k8s/ESO band mismatch** (e.g. ESO 0.12 on k8s 1.33) | Med | Interleave the climbs per §4.3; don't jump k8s far ahead of ESO or vice-versa. | +| **Many small applies = long, error-prone session** | Med | Script the per-stack `-target`-then-full loop; checkpoint with `kubectl get externalsecrets -A` after each; the rewrite itself is a single `sed`-class change so low semantic risk. | + +--- + +## 7. Rollback plan (per hop) + +- **During Phase 1 (chart climb, still v1beta1):** revert `version` to the previous minor in `stacks/external-secrets/main.tf`, `scripts/tg apply`, `git push`. Helm rolls the controller back; CRs unchanged. Clean. +- **During Phase 2 (v1 rewrite, on 0.16.2):** 0.16.2 serves both APIs, so you can `git revert` the apiVersion-bump commits and re-apply — the CRs flip back to v1beta1 cleanly (both served). Secrets unaffected (identical spec). This is the **last point of easy rollback**. +- **After Phase 3 (≥0.17, v1beta1 no longer served):** **rollback is HARD** — once etcd stores v1-only and the controller is ≥0.17, downgrading cannot re-serve v1beta1 and v1 objects can't be auto-converted back ([general guidance + maintainer position](https://github.com/external-secrets/external-secrets/issues/5478)). Treat **crossing 0.17 as the point of no return.** If you must recover: re-install 0.16.2 (serves both), restore CRs from the Phase-0 manifest snapshot, and restore Secrets from the Secret snapshot. This is a disaster-recovery path, not a routine rollback — hence the Phase-2 gate must be airtight. +- **Always available:** the Phase-0.3 Secret backups let you `kubectl apply` the last-good Secret to keep an app authenticating while you fix ESO. + +--- + +## 8. Verification + +**Per hop:** +- `kubectl -n external-secrets get deploy,po` healthy; controller image tag == target. +- `kubectl get externalsecrets -A` → all 104 `STATUS=SecretSynced` / `READY=True`. +- `kubectl get clustersecretstores` → `vault-kv` + `vault-database` `Valid=True`. + +**After Phase 2 (v1 rewrite):** +- `grep -rc 'external-secrets.io/v1beta1' stacks/` → **0**. +- `kubectl get externalsecrets.v1beta1.external-secrets.io -A` → still served on 0.16 (sanity), but `kubectl get externalsecrets.v1.external-secrets.io -A` is the real check. +- Spot-check a rotated DB cred end-to-end: e.g. `nextcloud-db-creds` value matches `vault read database/static-creds/mysql-nextcloud` and the app authenticates. + +**Final (2.6.0):** +- Controller image `v2.6.0`; all ExternalSecrets synced; both stores valid. +- Diff a sample of the 104 target Secrets against the Phase-0 backups → values unchanged (continuity proof). +- App health: spot-check 3–4 high-value consumers (nextcloud, immich, grafana, a `vault-database` consumer) — pods running, no auth errors in logs. +- **Compat-gate:** run the upgrade-state / k8s-version-check audit — ESO no longer flagged as a 1.35 blocker; k8s 1.35 upgrade proceeds. + +--- + +## 9. Open questions + +1. **k8s/ESO interleave ownership.** §4.3 shows narrow per-version k8s bands (0.16→1.32, 0.17→1.33, 2.x→1.34-1.35). The cluster is currently ≤1.31. **Who drives the interleave** — does this migration also advance k8s step-by-step, or does the autonomous version-check chain advance k8s and we time ESO hops to it? Need the exact current k8s version and the chain's behavior when ESO is the only gate. (Decisive for sequencing Phases 1/3.) +2. **2.6.0 ↔ k8s 1.35 explicit support.** The support matrix table currently ends at **2.5** (k8s 1.34-1.35). 2.6.0 exists on GitHub but the matrix row isn't appended yet; the whole 2.x band is consistently 1.34-1.35, so 2.6 on 1.35 is a *strong inference* not a quoted row. Confirm via `Chart.yaml` `kubeVersion` of 2.6.0 or a 2.6 release note before relying on it. ([matrix](https://external-secrets.io/latest/introduction/stability-support/)) +3. **Resolved helm provider version.** The stack only pins `vault ~> 4.0`; helm/k8s are unpinned (lockfile-resolved). Confirm the lockfile version and whether to pin it explicitly as part of this work. (Chart needs only Helm SDK v3 — likely a no-op, but verify.) +4. **Intermediate-minor patch selection.** Use latest patch of each minor (0.13.x, 0.14.x, 0.15.x). Confirm 0.16.**2** specifically (the note says 0.16.2 already supports v1) vs a later 0.16 patch. +5. **Per-stack apply automation.** 73 stacks × (target + full) apply is large. Acceptable to script a loop, or prefer manual per-stack with checkpoints? Some stacks have other in-flight drift that a full apply would also push — needs a clean-plan check per stack first. +6. **Stateful generators / advanced features.** Confirmed we use none (0 SecretStore/PushSecret/ClusterExternalSecret/generators), so the v0.14 generator and v2.0 provider-removal breaking changes are N/A — but re-confirm no generator usage crept in before Phase 3. + +--- + +## 10. Sources (decisive facts) + +- Skip-version policy + k8s support matrix: +- `v1` promoted to storage version (0.16.0): +- `v1beta1` removed / "rewrite manifests to v1 first" (0.17.0): +- No conversion webhook / "not a conversion issue" (#5478): +- v1beta1↔v1 schema identical / "nothing fancy" (#4785): +- App v1.0.0 ≠ API v1: +- v2.0.0 only removes Alibaba/Device42: +- Chart 2.6.0 on ArtifactHub: diff --git a/docs/plans/2026-06-21-t3-idle-migrate-design.md b/docs/plans/2026-06-21-t3-idle-migrate-design.md new file mode 100644 index 00000000..46c43bfa --- /dev/null +++ b/docs/plans/2026-06-21-t3-idle-migrate-design.md @@ -0,0 +1,140 @@ +# t3 idle-migrate — graceful overnight restart of deferred t3-serve instances — design + +- **Date:** 2026-06-21 +- **Status:** implemented 2026-06-21 (branch `wizard/t3-idle-migrate`; deployed + timer enabled on devvm, first overnight drain pending) +- **Owner:** Viktor (wizard) +- **Builds on:** the gated nightly tracker `t3-autoupdate` (re-enabled 2026-06-16, `scripts/t3-autoupdate.{sh,service,timer}`; design history in `docs/runbooks/t3-version-bump.md` + post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`) and the per-user `t3-serve@` systemd instances (`scripts/t3-serve@.service`). + +## Goal + +When `t3-autoupdate` **defers** a user's `t3-serve` restart because that user has an active agent at the daily 04:00–05:00 window, the user's running server keeps executing its start-time t3 version indefinitely — their client (which tracks the freshly-installed global binary) then shows *"Client and server versions differ."* For a user who is busy at every daily window (wizard: long-lived/AFK sessions overnight), the deferral never resolves and the skew persists for days. + +Add a **small, idle-gated overnight job that drains those deferrals**: restart a deferred `t3-serve@` onto the current binary **only when nothing is actively working** in that instance, so the migration happens during a genuine quiet gap rather than killing in-flight agent turns. + +## Background — why the skew persists (root cause, verified 2026-06-21) + +- All `t3-serve@` instances share ONE global `/usr/bin/t3` (→ `/usr/lib/node_modules/t3`). `t3-autoupdate` installs a new nightly to that single binary, health-gates it against a **copy** of wizard's populated `state.sqlite`, then **canary-restarts idle instances one at a time**, verifying pairing after each (`scripts/t3-autoupdate.sh` step 6). +- Its idle check is coarse — `unit_busy()`: + ```sh + pid=$(systemctl show -p MainPID --value "$unit") + pgrep -aP "$pid" | grep -qiE 'claude|codex|opencode' + ``` + i.e. "does the server have any `claude`/`codex`/`opencode` **child**?" But `t3 serve` keeps one such child alive per **open** session, even one idle awaiting input. Live snapshot 2026-06-21: wizard had **5 `running` provider sessions** (= 5 `claude` children) but only **3 mid-turn**, plus **89 `ready` (open-idle)** threads. So `unit_busy` is true whenever any tab is open → wizard is deferred at every window. +- The job runs **once daily** (`OnCalendar=*-*-* 04:00:00`, `RandomizedDelaySec=1h`, `Persistent` deliberately omitted) and **only acts on a version bump** (exits early if `installed == target`). So once the binary is already current, nothing re-triggers a restart of a still-stale running server until the *next* new nightly — and only if the user happens to be idle then. +- Confirmed in the logs: `t3-autoupdate: deferring t3-serve@wizard.service (active agent) — migrates on its next idle restart` on **both** Jun 20 and Jun 21 windows; wizard's server has been up since Jun 20 06:17 on `…20260620.605` while the binary + client are on `…20260621.613`. + +## Decisions (from brainstorm 2026-06-21) + +1. **"Safe to restart" = no turn in flight AND a quiet buffer.** Not "zero open sessions" (that would essentially never fire for wizard). Open-but-idle tabs are acceptable to drop — t3 persists thread history in `state.sqlite` and the client reconnects/resumes (the daily job already restarts idle instances routinely; restart→resume is the exercised path). To verify during implementation: the user-facing resume after a server restart. +2. **Cadence: overnight window only.** Frequent checks within a fixed overnight window; never disconnects tabs during the working day. Migrates within ~1 night of a build landing. +3. **Scope: all `t3-serve@`, self-limiting.** The job restarts only an instance that actually *owes* a migration (a deferral marker exists). Users already migrated at the daily window have no marker → no-op. No hardcoded per-user logic. +4. **Approach C: extract a shared safe-restart helper, reuse from both jobs.** One audited copy of the dangerous backup→restart→verify→recover logic; the new job adds only *scheduling + gating*. + +## Constraints (load-bearing) + +1. **The binary is global; migrations are forward-only and per-user-DB.** You cannot keep one user on the old version while others run the new one. A real-user forward-migration failure therefore means the build is unsafe for a real user → the only consistent recovery is the daily job's existing one (restore that user's DB + roll the **global** binary back to last-good + freeze + alert). This is a rare tail (the build was already migration-gated against a copy of wizard's real DB at install time), but the idle path must not invent a weaker recovery. +2. **Per-user secret boundary.** A user's `~/.t3/userdata/state.sqlite` is mode 600 and may not be read as another user. The job runs as root (system service) but reads each user's DB **as that user** via `runuser -u -- sqlite3 …` (the pattern `backup_all` already uses), read-only (`mode=ro`) so it never locks the live WAL. +3. **Fail closed.** Any uncertainty about whether an instance is safe to restart (DB locked/busy/unreadable, query error, unparseable timestamp) → treat as *not safe*, skip this tick, retry in 20 min. Never restart on doubt. +4. **Do not change the daily job's gated-install behavior.** The step-6 extraction must be behavior-preserving; health-gate, canary, downgrade-guard, freeze, and rollback stay exactly as today. +5. **Infra-as-code via the devvm installer.** Sources live in `scripts/`; deployment is `scripts/workstation/setup-devvm.sh` (the devvm is hand-managed VM 102 — no Terraform apply). Shared-devvm deploy takes a presence claim. + +## Design + +### Components + +Four new files in `scripts/` + a one-line addition to the existing job: + +1. **`scripts/t3-safe-restart.sh`** — shared library, sourced (not executed). Holds the per-unit "dangerous" routine extracted from `t3-autoupdate.sh` step 6 as `safe_restart_unit `: + pre-restart `VACUUM INTO` backup (as the owner) → `systemctl restart` → poll `verify_pairing` (15×2s ≈ 30s) → on failure: restore that user's DB from the pre-restart backup, `rollback_binary` to last-good, `touch $FREEZE_FILE`, log+alert. The shared helpers it needs (`LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `DISPATCH`/`BACKUP_DIR`/… config) move into the lib too. Installed to `/usr/local/lib/t3-safe-restart.sh`. + **Contract:** returns `0` on verified success, **non-zero** after performing recovery+freeze on failure. This is the one non-verbatim change to step-6 logic — today it `exit 1`s inline; the extracted function `return`s instead so the *caller* decides (the daily job `exit 1`s on non-zero exactly as today; the idle job `break`s). Behavior is otherwise identical. + +2. **`scripts/t3-migrate-idle.sh`** — the new job (scheduling + gating only). Installed to `/usr/local/bin/t3-migrate-idle`. Sources the lib; per tick, drains the deferral directory (control flow below). + +3. **`scripts/t3-migrate-idle.service`** — `Type=oneshot`, `ExecStart=/usr/local/bin/t3-migrate-idle`. (No `EnvironmentFile` needed; env-overridable knobs have defaults.) + +4. **`scripts/t3-migrate-idle.timer`** — overnight window, frequent checks: + ```ini + [Timer] + OnCalendar=*-*-* 01..05:00/20 # fires 01:00,01:20,…,05:40; none at/after 06:00. System TZ (UTC) — tune the window. + Persistent=false # never replay a missed migrate-restart at an unpredictable time + RandomizedDelaySec=120 + ``` + +5. **One-line edit to `t3-autoupdate.sh`** — in the existing defer branch, *also record* the deferral: + ```sh + LOG "deferring $unit (active agent) — migrates on its next idle restart" + mkdir -p "$DEFER_DIR" 2>/dev/null; printf '%s\n' "$target" > "$DEFER_DIR/$u" # NEW + deferred=$((deferred+1)); continue + ``` + where `DEFER_DIR=/var/lib/t3-autoupdate/deferred`. This is the *only* behavioral change to the scarred script beyond the verbatim step-6 extraction. + +### Why a deferral marker (not version-introspection) + +The marker makes "which instances owe a restart" **exact** and decouples it from the binary-is-current problem — the daily job already *knows* it deferred wizard, so it records that fact. The idle job drains the directory; the version string in the marker is informational (a restart always picks up whatever binary is current). The marker is removed only after the restart's pairing is verified. + +### Control flow of `t3-migrate-idle` (per tick) + +``` +for marker in $DEFER_DIR/*: # nothing deferred → no-op + user = basename(marker); unit = t3-serve@.service + [ unit is an active running service ] or { rm marker; continue } # gone + if unit ActiveEnterTimestamp > mtime(marker): rm marker; continue # already restarted (manual/other) → just clear + if not safe_to_restart(user): continue # mid-turn or not quiet → try next tick + target = contents(marker) + if safe_restart_unit(unit, target): rm marker # success: verified on new binary + else: # helper already restored DB + rolled back binary + froze + alerted + break # frozen: stop draining; a human investigates +``` + +### `safe_to_restart(user)` — the gate + +Single read-only query, run as the user: + +```sh +runuser -u "$user" -- sqlite3 "file:/home/$user/.t3/userdata/state.sqlite?mode=ro" " + SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') + - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" +``` + +- Column 1 = **active turns**; must be `0`. (`active_turn_id` is set exactly while a turn runs — verified 2026-06-21.) +- Column 2 = **idle seconds** = now − most-recent thread activity. Must be `≥ QUIET_SECONDS` (default **900** = 15 min, env-overridable). `updated_at` is ISO-8601 `…Z`; `datetime('now')`/`julianday('now')` are UTC, so normalizing `T`/`Z` away before `julianday()` keeps the arithmetic correct without depending on a newer SQLite's `Z` parsing. +- **NULL idle** (no threads at all) ⇒ safe. **Any error / non-numeric / nonzero exit** ⇒ not safe (constraint 3). + +### Failure recovery + +Delegated entirely to `safe_restart_unit` (the extracted, already-proven path): restore the user's DB from the pre-restart backup, roll the global binary back to last-good, `touch /etc/t3-autoupdate.freeze`, log+alert. The idle job then stops draining (the freeze halts both jobs until a human clears it) — see constraint 1 for why per-user divergence isn't an option. + +### Observability + +- Structured `logger -t t3-migrate-idle` lines; extend the existing `T3AutoUpdate*` Loki ruler/alerts to also match this tag. Success → one line: `migrated t3-serve@wizard → (idle restart; idle 47m)`. Failure → reuses the daily job's freeze+alert. +- **Recommended (optional):** a Pushgateway gauge for **deferral-marker age** + an alert if a marker survives **> 3 days** — passive visibility into "busy every night for 3 days," *not* the auto-escalation/daytime-widening that was explicitly de-scoped. + +### Delivery + +- Wire into `scripts/workstation/setup-devvm.sh` alongside the existing units: + - `install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh` + - `install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle` + - add `t3-migrate-idle.service t3-migrate-idle.timer` to the unit-install loop (→ `/etc/systemd/system/`) + - add `t3-migrate-idle.timer` to the `systemctl enable --now` list +- `homelab claim host:devvm --purpose "deploy t3-migrate-idle units"` before the install + enable on the shared devvm. +- No Terraform (hand-managed VM 102). + +## Testing + +- **TDD on the gating core (`bats`)** against fixture `state.sqlite` files: active turn → unsafe; idle-but-recent (< QUIET) → unsafe; idle + quiet → safe; empty DB → safe; locked/garbage DB / sqlite error → unsafe (fail-closed); marker drain: unit started after marker → clear+skip, before → eligible. +- **`T3_DRY_RUN=1`** mode logs `would migrate ` without acting. Roll out in dry-run first; confirm it flags wizard's server at a real overnight idle moment; then enable live. +- **Step-6 extraction is behavior-preserving** — validate the daily job's decisions are unchanged via a dry-run diff before/after the refactor. + +## Out of scope (YAGNI) + +- Daytime restarts / "around the clock" cadence (de-scoped: overnight only). +- Auto-escalation that widens to a daytime attempt after N stale nights (de-scoped; the optional marker-age alert covers visibility). +- Per-user opt-out file (not needed — the job is self-limiting via markers). +- Any change to how `t3-autoupdate` *installs/gates* a build. + +## Open questions + +None outstanding from the brainstorm. Two items to **verify during implementation** (not blockers): (a) user-facing session resume after a `t3-serve` restart; (b) the devvm's `sqlite3` parses the normalized timestamp as expected (the `replace()` normalization is the safeguard). diff --git a/docs/plans/2026-06-21-t3-idle-migrate-plan.md b/docs/plans/2026-06-21-t3-idle-migrate-plan.md new file mode 100644 index 00000000..ed75e234 --- /dev/null +++ b/docs/plans/2026-06-21-t3-idle-migrate-plan.md @@ -0,0 +1,729 @@ +# t3 idle-migrate Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add a small idle-gated overnight job that restarts a `t3-serve@` deferred by the daily autoupdate, so a chronically-busy user's server migrates onto the current t3 binary during a real quiet gap instead of staying version-skewed for days. + +**Architecture:** Extract the daily job's per-unit "dangerous" restart routine (backup→restart→verify→recover) into a sourced shared library `t3-safe-restart.sh`; the daily `t3-autoupdate` and a new `t3-migrate-idle` job both call it. The daily job records each deferral as a marker file; the new job drains markers overnight, restarting only when `state.sqlite` shows no in-flight turn and a quiet buffer has elapsed. Self-limiting (only acts on a recorded deferral), fail-closed. + +**Tech Stack:** bash, systemd timers, sqlite3 (reading t3's `state.sqlite`), the existing `t3-autoupdate` machinery. Deployed via `scripts/workstation/setup-devvm.sh` on the hand-managed devvm (no Terraform). + +**Design:** `docs/plans/2026-06-21-t3-idle-migrate-design.md`. + +--- + +## File structure + +- **Create `scripts/t3-safe-restart.sh`** — sourced library: shared config defaults, `LOG`/`ver`/`osusers`/`ak_for`/`verify_pairing`/`backup_user`/`prebump_of`/`rollback_binary`, and `safe_restart_unit`. One responsibility: the audited per-unit safe restart + its recovery. +- **Modify `scripts/t3-autoupdate.sh`** — source the lib; replace the inline helpers + step-6 body with calls into it; write/clear the deferral marker. Behavior unchanged. +- **Create `scripts/t3-migrate-idle.sh`** — the new job: the idle gate (`gate_query`/`gate_is_safe`/`safe_to_restart`) + the marker-drain loop. Main logic behind a `main`-guard so it's source-safe for tests. +- **Create `scripts/t3-migrate-idle.service`** + **`scripts/t3-migrate-idle.timer`** — oneshot + overnight timer. +- **Create `tests/t3-migrate-idle-gate.test.sh`** — pure-bash TDD for the gate predicates against fixture SQLite DBs (no root, no bats). +- **Modify `scripts/workstation/setup-devvm.sh`** — install + enable the new files. +- **Modify `docs/runbooks/t3-version-bump.md`** + **`.claude/reference/service-catalog.md`** — document the new job. + +**Recovery semantics note (load-bearing):** `safe_restart_unit` is reused verbatim. In the *daily* path a canary failure happens when `last_good < target`, so its `rollback_binary` genuinely reverts the global binary (correct — a bad build is bad for everyone). In the *idle* path `last_good == installed == target` (the build was already accepted), so `rollback_binary` is a **harmless no-op reinstall** — recovery reduces to "restore the failing user's DB + freeze + alert" and does NOT downgrade other users. Known rare-tail limitation: if that user's forward migration genuinely fails at idle time (already gated against a copy of their real DB at install), their server may crashloop on the restored DB until a human acts on the freeze+alert. Documented, not hidden. + +--- + +## Task 1: Shared library `t3-safe-restart.sh` + +**Files:** +- Create: `scripts/t3-safe-restart.sh` + +- [ ] **Step 1: Create the library** + +```bash +#!/usr/bin/env bash +# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh +# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer). +# +# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing -> +# recover (restore DB + roll global binary back to last-good + freeze) — extracted +# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on. +# The only change from the inline original: safe_restart_unit RETURNS non-zero on +# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER +# decides what to do (the daily job exits; the idle job stops draining). +# +# Callers must set, before calling safe_restart_unit: $target (version being moved +# TO, for log lines + the prebump filename) and $last_good (rollback target). +# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle"). + +# ---- shared config defaults (override via env before sourcing) ------------------ +: "${LOG_TAG:=t3-safe-restart}" +: "${FREEZE_FILE:=/etc/t3-autoupdate.freeze}" +: "${STATE_DIR:=/var/lib/t3-autoupdate}" +: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}" +: "${DEFER_DIR:=$STATE_DIR/deferred}" +: "${BACKUP_DIR:=/var/backups/t3-state}" +: "${DISPATCH:=127.0.0.1:3780}" +: "${USER_MAP:=/etc/ttyd-user-map}" +: "${T3_BACKUP_TIMEOUT:=900}" + +LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; } +ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; } +# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line). +osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; } +# authentik username for an OS user (reverse map; first match) — for dispatch verify. +ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; } + +# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the +# WAL stays owned; never stops the serve). Uses global $target for the filename. +# Echoes the backup path on success; non-zero on failure. +backup_user() { + local u="$1" src out dst ts + src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1 + ts="$(date +%Y%m%d-%H%M%S)" + out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite" + install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out" + if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then + printf '%s\n' "$dst"; return 0 + fi + rm -f "$dst"; return 1 +} + +# newest pre-bump backup for a user taken for the current $target (restore source). +prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; } + +# roll the GLOBAL binary back to last-good. In the idle path last_good==installed, +# so this is a harmless no-op reinstall (does NOT downgrade other users). +rollback_binary() { + LOG "rolling back binary $target -> $last_good" + if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi + LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1 +} + +# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie). +verify_pairing() { + local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; } + out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)" + printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session=' +} + +# safe_restart_unit : restart the unit, verify pairing; on failure +# restore the user's DB from its pre-restart backup, roll the binary back, freeze. +# Assumes a pre-restart backup already exists for at the current $target +# (the daily job's backup_all, or the idle job's backup_user, takes it first). +# Returns 0 on verified success, non-zero after recovery+freeze on failure. +safe_restart_unit() { + local unit="$1" u="$2" ok=0 _ bak + systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero" + for _ in $(seq 1 15); do + if verify_pairing "$u"; then ok=1; break; fi + sleep 2 + done + if [ "$ok" = "1" ]; then + LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0 + fi + LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB" + rollback_binary + bak="$(prebump_of "$u")" + if [ -n "$bak" ]; then + systemctl stop "$unit" 2>/dev/null + if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then + rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm" + LOG "restored $u state.sqlite from $bak" + fi + systemctl start "$unit" 2>/dev/null + fi + touch "$FREEZE_FILE" 2>/dev/null + LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume" + return 1 +} +``` + +- [ ] **Step 2: Syntax + lint check** + +Run: `bash -n scripts/t3-safe-restart.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-safe-restart.sh || echo "shellcheck absent — skipped")` +Expected: no syntax errors. (shellcheck may warn on the intentional global `$target`/`$last_good` references — acceptable; they are documented caller-set globals.) + +- [ ] **Step 3: Source-and-define smoke test** + +Run: +```bash +bash -c 'LOG_TAG=test; . scripts/t3-safe-restart.sh; for f in LOG ver osusers ak_for backup_user prebump_of rollback_binary verify_pairing safe_restart_unit; do declare -F "$f" >/dev/null || { echo "MISSING $f"; exit 1; }; done; echo "all functions defined"' +``` +Expected: `all functions defined` (sourcing has no side effects — no exit, no output beyond the echo). + +- [ ] **Step 4: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-safe-restart.sh +git "${GC[@]}" commit -m "t3-safe-restart: extract shared safe-restart library from t3-autoupdate + +Pull the per-unit backup->restart->verify->recover routine (and the small +helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second +job (the upcoming idle migrator) can reuse the exact same audited recovery path +instead of forking safety-critical code. safe_restart_unit returns non-zero on +failure (after recovery+freeze) rather than exiting, so callers control flow. + +Co-Authored-By: Claude Opus 4.8 " +``` + +--- + +## Task 2: Refactor `t3-autoupdate.sh` to use the library + record deferrals + +**Files:** +- Modify: `scripts/t3-autoupdate.sh` (config block 32–42, helpers 44–165, step 6 loop 194–225) + +- [ ] **Step 1: Source the library; drop the now-shared helpers** + +Replace lines 32–52 (the `T3_*` config block through the `newer()` helper) with — keep the autoupdate-only config, source the lib for the shared bits: + +```bash +# ---- autoupdate-specific config (shared config + helpers come from the lib) ----- +T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest) +T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking) +SMOKE_PORT="${T3_SMOKE_PORT:-3799}" +DRY_RUN="${T3_DRY_RUN:-0}" +TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it + +LOG_TAG=t3-autoupdate +# shellcheck source=scripts/t3-safe-restart.sh +. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}" + +# is $1 a strictly-newer version than $2 (version-sort)? +newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; } + +mkdir -p "$STATE_DIR" 2>/dev/null || true +``` + +(The lib now provides `FREEZE_FILE`, `STATE_DIR`, `LAST_GOOD_FILE`, `DEFER_DIR`, `BACKUP_DIR`, `DISPATCH`, `USER_MAP`, `LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `backup_user`, `safe_restart_unit`.) + +- [ ] **Step 2: Simplify `backup_all` to call the shared `backup_user`** + +Replace the `backup_all()` definition (lines 90–105) with: + +```bash +ADMIN_SEED="" +backup_all() { + local u dst + for u in $(osusers); do + if dst="$(backup_user "$u")"; then + LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)" + [ "$u" = "wizard" ] && ADMIN_SEED="$dst" + else + LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)" + fi + done + [ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)" +} +``` + +Delete the now-duplicated standalone `prebump_of`, `rollback_binary`, and `verify_pairing` definitions (lines 107–108, 146–152, 160–165) — they come from the lib. Keep `health_check` and `unit_busy` (autoupdate-only). + +- [ ] **Step 3: Use `safe_restart_unit` + write/clear the deferral marker in step 6** + +Replace the step-6 loop body (lines 196–225) with: + +```bash +for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do + u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue + if unit_busy "$unit"; then + LOG "deferring $unit (active agent) — migrates on its next idle restart" + mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle + deferred=$((deferred+1)); continue + fi + if safe_restart_unit "$unit" "$u"; then + restarted=$((restarted+1)) + rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker + else + exit 1 # frozen by safe_restart_unit — preserve today's behavior + fi +done +``` + +- [ ] **Step 4: Syntax check + behavior-preserving dry-run diff** + +Run: +```bash +bash -n scripts/t3-autoupdate.sh +# Confirm the only remaining defer/restart decisions are unchanged vs HEAD~1 logic: +git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false diff HEAD scripts/t3-autoupdate.sh | grep -E '^\+|^-' | grep -vE 'safe_restart_unit|backup_user|DEFER_DIR|source|\. "|LOG_TAG|^\+\+\+|^---' | head -40 +``` +Expected: no syntax errors; the diff shows only the extraction (calls replacing inline bodies) + the two marker lines — no change to install/health-gate/canary decision logic. + +- [ ] **Step 5: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-autoupdate.sh +git "${GC[@]}" commit -m "t3-autoupdate: source the shared safe-restart lib + record deferrals + +Behavior-preserving refactor: the per-unit restart/recover body and small helpers +now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is +deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/ +so the new idle migrator can drain it later; clear the marker on a successful +restart. Install/health-gate/canary logic is unchanged. + +Co-Authored-By: Claude Opus 4.8 " +``` + +--- + +## Task 3: The idle gate (TDD) — `gate_query` + `gate_is_safe` + +**Files:** +- Create: `tests/t3-migrate-idle-gate.test.sh` +- Create (incremental): `scripts/t3-migrate-idle.sh` (gate functions only this task) + +- [ ] **Step 1: Write the failing test** + +Create `tests/t3-migrate-idle-gate.test.sh`: + +```bash +#!/usr/bin/env bash +# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker. +# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree. +set -uo pipefail +HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down) +export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh" +# shellcheck source=/dev/null +. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running + +pass=0; fail=0 +ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; } +notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; } + +# --- gate_is_safe with QUIET_SECONDS=900 --- +QUIET_SECONDS=900 +ok gate_is_safe 0 1000 # idle, quiet long enough -> safe +notok gate_is_safe 1 1000 # a turn in flight -> unsafe +notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe +ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe +notok gate_is_safe x 1000 # unparseable active -> unsafe +notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe + +# --- gate_query against fixture SQLite DBs --- +TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT +mkfix() { # mkfix ; reads rows "active_turn_id|updated_at" on stdin + local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);" + while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done +} +NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)" +OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)" + +# active turn present -> "1|" +printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db" +res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1" + +# all idle, last activity 1h ago -> "0|>=3500" +printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db" +res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500 + +# empty table -> "0|" (NULL idle) +sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);" +res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0" + +echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ] +``` + +- [ ] **Step 2: Run it to verify it fails** + +Run: `bash tests/t3-migrate-idle-gate.test.sh` +Expected: FAIL — `scripts/t3-migrate-idle.sh` does not exist yet (source error). + +- [ ] **Step 3: Create `scripts/t3-migrate-idle.sh` with the gate functions + main-guard skeleton** + +```bash +#!/usr/bin/env bash +# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight +# t3-migrate-idle.timer). For each deferred t3-serve@, if nothing is actively +# working in that instance (no in-flight turn + a quiet buffer), restart it onto the +# current binary using the shared safe_restart_unit, then clear the marker. +# Why this exists: t3-autoupdate defers a user with an active agent at its single +# daily window; a user busy every night never migrates and their client shows +# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*. +set -uo pipefail + +LOG_TAG=t3-migrate-idle +# shellcheck source=scripts/t3-safe-restart.sh +. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}" + +QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min) +DRY_RUN="${T3_DRY_RUN:-0}" + +# pure logic: is it safe given and ? fail closed. +gate_is_safe() { + local active="$1" idle="$2" + case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe + [ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe + [ -z "$idle" ] && return 0 # no threads at all -> safe + case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe + [ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe +} + +# query a state.sqlite (path or file: URI). Echoes "|". +# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday. +gate_query() { + local db="$1" + sqlite3 -batch -noheader -separator '|' "$db" \ + "SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" +} + +# safe_to_restart : wire runuser + the user's DB into gate_query/gate_is_safe. +safe_to_restart() { + local u="$1" db row + db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1 + row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \ + "SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" 2>/dev/null)" || return 1 + gate_is_safe "${row%%|*}" "${row##*|}" +} + +main() { + : # drain loop added in Task 4 +} + +# main-guard: run only when executed, not when sourced (tests source this file). +if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `bash tests/t3-migrate-idle-gate.test.sh` +Expected: `PASS=10 FAIL=0` (exit 0). + +- [ ] **Step 5: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-migrate-idle.sh tests/t3-migrate-idle-gate.test.sh +git "${GC[@]}" commit -m "t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD + +The gate reads t3's state.sqlite: safe to restart only when zero threads have an +active_turn_id AND the most-recent thread activity is older than the quiet buffer +(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover +the boundaries against fixture DBs (no root/bats/Docker). + +Co-Authored-By: Claude Opus 4.8 " +``` + +--- + +## Task 4: The marker-drain loop in `t3-migrate-idle.sh` + +**Files:** +- Modify: `scripts/t3-migrate-idle.sh` (replace the `main()` skeleton) + +- [ ] **Step 1: Implement `main()` (the drain loop)** + +Replace the `main() { : ; }` skeleton with: + +```bash +main() { + # a frozen build must not be auto-migrated (shared switch with t3-autoupdate) + if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi + [ -d "$DEFER_DIR" ] || exit 0 # nothing deferred + last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper + + local marker u unit started mwritten migrated=0 skipped=0 + for marker in "$DEFER_DIR"/*; do + [ -e "$marker" ] || continue # empty-dir glob + u="$(basename "$marker")"; unit="t3-serve@$u.service" + if ! systemctl is-active --quiet "$unit"; then + LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue + fi + started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)" + mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)" + if [ "$started" -gt "$mwritten" ]; then + LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue + fi + if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi + + target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)" + if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi + if ! backup_user "$u" >/dev/null; then + LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue + fi + if safe_restart_unit "$unit" "$u"; then + LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1)) + else + LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1 + fi + done + LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)" +} +``` + +- [ ] **Step 2: Re-run the gate tests (regression — main-guard still source-safe)** + +Run: `bash tests/t3-migrate-idle-gate.test.sh` +Expected: `PASS=10 FAIL=0` (sourcing still defines functions without running the loop). + +- [ ] **Step 3: Syntax + lint** + +Run: `bash -n scripts/t3-migrate-idle.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-migrate-idle.sh || echo "shellcheck absent — skipped")` +Expected: no syntax errors. + +- [ ] **Step 4: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-migrate-idle.sh +git "${GC[@]}" commit -m "t3-migrate-idle: drain deferral markers when safe + +For each /var/lib/t3-autoupdate/deferred/ marker: skip+clear if the unit is +gone or was already restarted after the deferral; otherwise, when the idle gate is +satisfied, take a pre-restart backup and restart via the shared safe_restart_unit, +clearing the marker on verified success. DRY_RUN logs decisions without acting. + +Co-Authored-By: Claude Opus 4.8 " +``` + +--- + +## Task 5: systemd units + +**Files:** +- Create: `scripts/t3-migrate-idle.service`, `scripts/t3-migrate-idle.timer` + +- [ ] **Step 1: Create the service unit** + +`scripts/t3-migrate-idle.service`: +```ini +[Unit] +Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle +Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md +After=network.target t3-dispatch.service + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/t3-migrate-idle +``` + +- [ ] **Step 2: Create the timer unit** + +`scripts/t3-migrate-idle.timer`: +```ini +[Unit] +Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration) + +[Timer] +OnCalendar=*-*-* 01..05:00/20 +RandomizedDelaySec=120 +Persistent=false + +[Install] +WantedBy=timers.target +``` + +- [ ] **Step 3: Validate unit syntax** + +Run: `systemd-analyze verify scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer 2>&1 | grep -v 'Unknown\|Cannot find' || echo "units parse OK"` +Expected: no fatal parse errors (warnings about the `[Install]` of a non-installed unit / missing exec on a non-deployed path are acceptable in the worktree). + +- [ ] **Step 4: Confirm the OnCalendar expands to the intended overnight slots** + +Run: `systemd-analyze calendar '*-*-* 01..05:00/20' --iterations=5` +Expected: next elapses at 01:00/01:20/01:40/02:00/… (every 20 min, hours 01–05). + +- [ ] **Step 5: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer +git "${GC[@]}" commit -m "t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20) + +Co-Authored-By: Claude Opus 4.8 " +``` + +--- + +## Task 6: Wire into `setup-devvm.sh` + +**Files:** +- Modify: `scripts/workstation/setup-devvm.sh` (9a install ~line 164; 9d unit loop ~line 200; enable ~line 218) + +- [ ] **Step 1: Install the lib + the new script (section 9a)** + +After the `install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate` line, add: +```bash +install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh +install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle +``` + +- [ ] **Step 2: Install the unit files (section 9d loop)** + +Add to the `for u in …` unit list (after the `t3-autoupdate.service t3-autoupdate.timer \` line): +```bash + t3-migrate-idle.service t3-migrate-idle.timer \ +``` + +- [ ] **Step 3: Enable the timer (section 9 enable line)** + +Append `t3-migrate-idle.timer` to the `systemctl enable --now` list: +```bash +systemctl enable --now t3-dispatch.service \ + t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \ + log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)" +``` + +- [ ] **Step 4: Syntax check** + +Run: `bash -n scripts/workstation/setup-devvm.sh` +Expected: no syntax errors. + +- [ ] **Step 5: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/workstation/setup-devvm.sh +git "${GC[@]}" commit -m "setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer) + +Co-Authored-By: Claude Opus 4.8 " +``` + +--- + +## Task 7: Deploy to the devvm + validate (dry-run first) + +**Files:** none (operational). Presence-claimed, shared-host mutation. + +- [ ] **Step 1: Claim the host** + +Run: `homelab claim host:devvm --purpose "deploy t3-migrate-idle units (idle-gated t3-serve migration)"` +Expected: claim acquired (if already held by another session, defer per CLAUDE.md). + +- [ ] **Step 2: Install the artifacts (mirror setup-devvm.sh 9a/9d)** + +Run: +```bash +W=/home/wizard/code/infra/.worktrees/t3-idle-migrate/scripts +sudo install -m 0644 "$W/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh +sudo install -m 0755 "$W/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle +sudo install -m 0644 "$W/t3-migrate-idle.service" /etc/systemd/system/t3-migrate-idle.service +sudo install -m 0644 "$W/t3-migrate-idle.timer" /etc/systemd/system/t3-migrate-idle.timer +sudo systemctl daemon-reload +``` +Expected: no errors. + +- [ ] **Step 2b: Re-point the live daily job at the installed lib (it now sources it)** + +The deployed `/usr/local/bin/t3-autoupdate` is the OLD inline version until setup-devvm re-runs; install the refactored one so both jobs share the lib: +```bash +sudo install -m 0755 "$W/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate +sudo /usr/local/bin/t3-autoupdate # safe: same-version run exits at "already on nightly; nothing to do" +``` +Expected: log line `already on =; nothing to do` (proves the refactored daily job sources the lib and runs clean). + +- [ ] **Step 3: DRY-RUN the idle migrator against live state** + +Run: `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"` +Expected: with wizard currently busy (mid-turn during the day), a `skipped` count — `idle-migrate pass complete (migrated=0 skipped=N)` — and NO restart. (If wizard happens to be idle+quiet, it logs `DRY_RUN: would migrate t3-serve@wizard …` and still does not act.) + +- [ ] **Step 4: Seed a deferral marker for the current skew + dry-run again** + +The live daily job already deferred wizard but the marker mechanism is new, so create it once to represent the existing `.605→.613` debt: +```bash +sudo install -d -m755 /var/lib/t3-autoupdate/deferred +printf '%s\n' "$(t3 --version | awk '{print $NF}' | sed 's/^v//')" | sudo tee /var/lib/t3-autoupdate/deferred/wizard >/dev/null +sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?" +``` +Expected: the pass now considers `wizard` — either `DRY_RUN: would migrate t3-serve@wizard.service -> …613` (if idle) or counted in `skipped` (if mid-turn). Confirms marker drain + gate wiring end-to-end without acting. + +- [ ] **Step 5: Enable the timer (live)** + +Run: `sudo systemctl enable --now t3-migrate-idle.timer && systemctl list-timers t3-migrate-idle.timer --no-pager` +Expected: timer active, next elapse in the 01:00–05:40 window. + +- [ ] **Step 6: Release the claim** + +Run: `homelab release host:devvm` + +> **First live migration** happens overnight at the first idle+quiet tick. Verify next session: `journalctl -u t3-migrate-idle.service --since yesterday | grep -E 'migrated|skipped|DRY|FROZEN'` and `t3 --version` vs the running server's version. (The user-facing resume-after-restart is observed here — design open-question (a).) + +--- + +## Task 8: Docs + +**Files:** +- Modify: `docs/runbooks/t3-version-bump.md` (add an idle-migrate section) +- Modify: `.claude/reference/service-catalog.md` (add the unit) +- Modify: `docs/plans/2026-06-21-t3-idle-migrate-design.md` (Status → implemented) + +- [ ] **Step 1: Runbook** — add a section after the autoupdate description: + +```markdown +## Idle migrator (`t3-migrate-idle.timer`) + +`t3-autoupdate` defers a user's `t3-serve` restart when they have an active agent +at the daily window, recording `/var/lib/t3-autoupdate/deferred/`. +`t3-migrate-idle` (overnight, every 20 min 01:00–05:40) drains those markers: +it restarts a deferred instance onto the current binary only when that user's +`state.sqlite` shows no in-flight turn (`active_turn_id`) and ≥15 min quiet, via +the shared `safe_restart_unit` (same backup→verify→recover as the daily canary). +- **Force a migration now:** `sudo systemctl start t3-migrate-idle.service` (still idle-gated). +- **Preview without acting:** `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle`. +- **Stop it:** the shared `/etc/t3-autoupdate.freeze` halts both jobs. +- **Rare-tail failure:** a forward-migration failure at idle restart restores the + user's DB + freezes + alerts (the binary rollback is a no-op since the build was + already accepted); the user's server may crashloop on the restored DB until the + freeze is cleared. Investigate per the rollback section above. +``` + +- [ ] **Step 2: service-catalog** — add a row/line for `t3-migrate-idle.timer` (overnight idle-gated t3-serve migration; sources `t3-safe-restart.sh`). + +- [ ] **Step 3: design doc status** — change the header `Status:` to `implemented 2026-06-21 (commits on wizard/t3-idle-migrate)`. + +- [ ] **Step 4: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add docs/runbooks/t3-version-bump.md .claude/reference/service-catalog.md docs/plans/2026-06-21-t3-idle-migrate-design.md +git "${GC[@]}" commit -m "docs: t3-migrate-idle runbook + service-catalog + design status + +Co-Authored-By: Claude Opus 4.8 " +``` + +--- + +## Task 9: Land + +- [ ] **Step 1: Merge latest master into the branch** + +Run: +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" fetch forgejo +git "${GC[@]}" merge --no-edit forgejo/master +``` +Expected: clean merge (no conflicts; the files are new or autoupdate-only). Resolve if any. + +- [ ] **Step 2: Re-run the gate tests post-merge** + +Run: `bash tests/t3-migrate-idle-gate.test.sh` +Expected: `PASS=10 FAIL=0`. + +- [ ] **Step 3: Push to master** + +Run: `git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master` +Expected: accepted. Non-fast-forward → fetch/merge/retry. + +- [ ] **Step 4: Watch CI to completion** + +Run: `homelab ci watch` +Expected: green (infra apply pipeline — this change is scripts/docs only, no Terraform, so apply is a no-op for it). + +- [ ] **Step 5: Clean up the worktree** + +Run (from the main checkout): +```bash +git -C /home/wizard/code/infra worktree remove .worktrees/t3-idle-migrate +git -C /home/wizard/code/infra branch -d wizard/t3-idle-migrate +``` + +--- + +## Self-review + +- **Spec coverage:** marker mechanism (T2,T4) · shared safe-restart lib / approach C (T1) · idle gate active_turn_id+quiet (T3) · overnight timer (T5) · all-users self-limiting via markers (T4 loop) · failure recovery reuse (T1, note) · observability logs (LOG_TAG throughout) · delivery via setup-devvm (T6) · presence-claimed deploy (T7) · TDD on the gate (T3) · dry-run rollout (T7) · docs (T8). Optional Pushgateway marker-age gauge from the design is **intentionally deferred** (logged here as a follow-up, not built — keeps scope to the shipping mechanism). +- **Placeholders:** none — every file has complete content; every command has expected output. +- **Type/name consistency:** `safe_restart_unit`, `backup_user`, `prebump_of`, `gate_query`, `gate_is_safe`, `safe_to_restart`, `DEFER_DIR`, `QUIET_SECONDS`, `T3_SAFE_RESTART_LIB`, `LOG_TAG` used identically across tasks. `target`/`last_good` are documented caller-set globals consumed by lib functions. diff --git a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md new file mode 100644 index 00000000..664869fa --- /dev/null +++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md @@ -0,0 +1,131 @@ +# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop + +## Impact + +- devvm (VM 102, the shared multi-user Claude Code workstation) became + unresponsive under combined memory + IO pressure and had to be **hard-killed + + rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for + wizard/emo/anca lost, in-flight agents killed. +- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM + 22.5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible + IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES / + 64% CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP + instances across three users on top. + +## This is the "crawl" class, not the QEMU-stall class + +The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a +*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI +controller. That fix shipped (verified 2026-06-22: the guest now boots on +`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem +explicitly deferred **this** class: + +> The recurring *crawl* class (agent storms → swap-thrash; journald +> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux +> sessions remain memory-uncontained by **explicit decision (swap-only, +> 2026-06-10)**. + +That explicit decision is the root cause closed here. + +## Root cause + +Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only +one was capped: + +| Tree | cgroup | Cap before today | +|---|---|---| +| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ | +| **ssh/tmux sessions** | `user.slice/user-.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ | + +The uncapped `user-.slice` was the hole. A runaway there (the 10G `ugrep`; +stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and +swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the +overload chain: + +``` +uncapped tmux growth → disk-swap thrash on a throttled spindle + → IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill +``` + +i.e. **memory pressure becomes the IO storm**. There was also **no global OOM +backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the +kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely +(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*. + +## Fix (`setup-devvm.sh` §10, applied live 2026-06-22) + +Design decisions (interviewed with the admin via `/grill-me`): **soft-generous +per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising +single-user utilisation while making a box-wide wedge impossible. (The backstop +was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd +proved inert with `swap=0` — see Verification + Lessons.) + +| Layer | What | +|---|---| +| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. | +| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. | +| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. | +| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. | +| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. | + +Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to +`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone* +heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. + +## Verification (live, 2026-06-22) + +- **Caps live on running cgroups**: all three `user-.slice` report + `memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`; + daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered + under `docker.slice`. +- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was + killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with + **swap flat at 0MB throughout** — no thrash. Same mechanism protects every user + slice (16G) and `docker.slice` (8G). +- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99% + memory.pressure, throttled to a crawl, making no progress and harming nothing — + a runaway is throttled, not just killed. +- **systemd-oomd disproven, then dropped**: a self-policed balloon held + `memory.pressure full avg10 = 96–99%` (≫ its 20% limit) for >70s but oomd never + killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active + reclaim, which a `swap=0` anon workload never does. oomd was purged. +- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs + `low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects + `SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live + earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`. + +## Out of scope / follow-ups + +- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min + detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure + early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill; + `-N /script` can push a metric). devvm node-exporter is already scraped + (`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a + monitoring-stack Terraform change). +- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in + compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix. +- **Per-user docker isolation**: containers share one `docker.slice` budget, not + per-user. Fine for current usage (krr + short-lived tools). +- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are + host-level (bead `code-oflt`); unchanged here. + +## Lessons + +- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.** + Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean + local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns + the failure back into a contained, local kill. +- **Cap the box, not one surface.** t3 sessions were capped for months while the + same user's tmux was unbounded — and the caps that existed didn't sum to < RAM. + Containment has to reason about every tree and the aggregate. +- **A backstop must protect the operator's way in.** earlyoom `--avoid`s + sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays + reachable to recover; only the agent/browser hogs are eligible victims. +- **systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.** + oomd's memory-pressure killer only fires on cgroups doing active reclaim + (`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to + reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never + acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO + storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the + correct pairing. A famous tool that "does OOM" still has to be proven to fire + under *your* configuration. diff --git a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md new file mode 100644 index 00000000..e6b11816 --- /dev/null +++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md @@ -0,0 +1,97 @@ +# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24) + +> Filename kept for inbound links. The originally-suspected cause (kubeadm-config +> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC +> drift was a real *separate* latent bug fixed in the same change. + +**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached +the master control-plane phase for the first time — preflight passed, etcd +snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the +kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute +static-pod-hash window across all internal retries, then auto-rolled-back to +v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but +the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**. +No data loss; no user-facing outage (the master carries control-plane taints, so +no workloads were displaced). + +**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the +first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane +static pods, i.e. the first time the upgrade pushes real write-IO at etcd. + +## Root cause — etcd IO starvation on the shared HDD + +The new kube-apiserver could not establish/keep a working connection to etcd +during the upgrade because **etcd was IO-starved**. etcd's surviving container log +from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows: + +- **1,180** `apply request took too long` warnings in 16 minutes; +- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms), + clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying + to bring the new apiserver up. + +A reproduced 1.35.6 apiserver with no etcd dies with +`F instance.go:233 Error creating leases: error creating storage factory: context +deadline exceeded` — the same failure mode a multi-second etcd produces. etcd +lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on +shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto +that spindle: + +1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected); +2. kubeadm dumping a full **~400MB etcd DB backup** to + `/etc/kubernetes/tmp/kubeadm-backup-etcd-/` (on the same HDD) before the + etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never + cleans them up), pushing master root fs to **73%**, above the 70% kubelet + image-GC threshold, so image GC churned during the drain too; +3. master-drain pod evictions. + +### Correction — it was NOT the OIDC flag swap + +`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps +`--authentication-config` (structured multi-issuer OIDC) back to legacy +single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That +was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with +those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly +(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test +etcd. So the auth swap does **not** crash the apiserver; it was a red herring for +the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full +were also ruled out. + +## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift + +apiserver auth is configured in three places that must agree: +(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes` ++ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest +(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM — +which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates +the manifest from (3), so it would have reverted structured auth → **dashboard + +kubectl SSO break after a successful upgrade** (recoverable: the chain's +post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash. + +## Resolution + +1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%. +2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps. +3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run). + +## Prevention (landed in this change) + +| Gap | Fix | +|-----|-----| +| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. | +| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. | +| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. | + +## Lessons + +- **Capture the failing component's own logs before concluding.** The `kubeadm + upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second + applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is + "what config changes," not "why it crashed." +- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm + 2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB + backup copy + drain) onto that spindle. code-oflt is the real fix. +- **Tools that leave per-operation scratch must be reaped.** kubeadm's + `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never + GC'd; 28GB had silently accumulated. +- **Out-of-band control-plane edits must be written back to kubeadm-config** — else + `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags). diff --git a/docs/runbooks/claude-auth-renew-workstation.md b/docs/runbooks/claude-auth-renew-workstation.md new file mode 100644 index 00000000..f5ce6625 --- /dev/null +++ b/docs/runbooks/claude-auth-renew-workstation.md @@ -0,0 +1,95 @@ +# Workstation Claude authentication renewal + +## Scope + +Every roster user authenticates Claude Code with their own Enterprise identity. +Credentials are never shared between OS users. Claude refreshes its normal OAuth +access token; `claude-auth-sync@.timer` verifies that refresh using real +inference every six hours and backs up only the `claudeAiOauth` object to: + +```text +secret/workstation/claude-users/ +``` + +The user's unrelated `mcpOAuth` credentials never leave their home directory. +Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at +`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's +path. The service renews the Vault token on every run. + +## Normal lifecycle + +1. Add the user to `scripts/workstation/roster.yaml` and apply the Vault stack. +2. Run `scripts/workstation/setup-devvm.sh` as root with the admin Vault token. + Its foreground provisioner mints the isolated periodic token and enables the + user's timer. Routine hourly provisioning never needs an admin token. +3. The user completes one initial Enterprise login: + + ```bash + claude auth login --claudeai --sso --email + ``` + +4. Start the first sync immediately instead of waiting for the timer: + + ```bash + systemctl start claude-auth-sync@.service + systemctl status claude-auth-sync@.service + ``` + +Success writes no secrets to the journal. The user's private log records `OK` in +`~/.local/state/claude-auth-sync/sync.log`; journald receives the same status with +`identifier=claude-auth-sync` for Loki alerting. + +## Automatic recovery + +`claude auth status` is not a sufficient health check: it can report logged in +while inference returns HTTP 401. The service therefore runs a minimal Haiku +inference with no session persistence. On failure it: + +1. reads the user's latest OAuth object from Vault; +2. atomically merges it into `.credentials.json`, preserving MCP OAuth state; +3. retries inference once; +4. stores the newly refreshed OAuth object back in Vault on success. + +Vault KV version history remains available for audit, but the service deliberately +does not cycle through old refresh tokens: providers commonly invalidate rotated +refresh tokens, so replaying old versions can make recovery less deterministic. + +## Recovery requiring a person + +If both local state and the latest Vault copy fail, the refresh token was revoked, +invalidated, or the Enterprise session requires reauthorization. Run the login as +the affected OS user, then rerun the service: + +```bash +claude auth login --claudeai --sso --email +systemctl start claude-auth-sync@$(id -un).service +``` + +If the scoped Vault token expired or drift protection rejected it, rerun the root +provisioner with an admin Vault token after confirming the matching policy exists: + +```bash +export VAULT_ADDR=https://vault.viktorbarzin.me +export VAULT_TOKEN="$(cat /home/wizard/.vault-token)" +sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users +``` + +Never copy another user's `.credentials.json` or scoped Vault token. Never restore +the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user +login and would silently collapse all users onto one identity. + +## Verification + +```bash +systemctl list-timers 'claude-auth-sync@*' +systemctl status claude-auth-sync@.service +journalctl -t claude-auth-sync --since today +``` + +Inspect Vault metadata, not secret values: + +```bash +vault kv metadata get secret/workstation/claude-users/ +``` + +Alert `WorkstationClaudeAuthInvalid` fires when any renewal agent logs `FAIL`. diff --git a/docs/runbooks/forgejo-open-signups.md b/docs/runbooks/forgejo-open-signups.md new file mode 100644 index 00000000..5a00d15a --- /dev/null +++ b/docs/runbooks/forgejo-open-signups.md @@ -0,0 +1,168 @@ +# Runbook: Forgejo open self-service signups + +Last updated: 2026-06-19 + +`forgejo.viktorbarzin.me` allows **open native self-registration** (anyone can +create a local Forgejo account from the web form), gated against bots by two +layers: + +1. **Cloudflare Turnstile** captcha on the registration form. +2. **Mandatory email confirmation** — a new account stays inactive until the + user clicks an activation link emailed to the address they registered with. + +Two external login sources also work alongside local accounts: the pre-existing +**Sign in with GitHub** OAuth2 login (the **Authentik OAuth2 source is now DISABLED** — see the GitHub section below) (see the GitHub +section below). Opening local signups was additive — it did not touch SSO. + +Most of this is Terraform-managed in `stacks/forgejo/`. The one exception is the +OAuth2 login *sources* (Authentik, GitHub), which live in Forgejo's own DB and +are added via `forgejo admin auth` — there is no clean Terraform resource for +them (their secrets are mirrored to Vault for recovery). + +## What is configured (and where) + +All on the `kubernetes_deployment.forgejo` container env in +`stacks/forgejo/main.tf` (Forgejo reads `app.ini` keys from `FORGEJO__
__` +env vars): + +| Setting | Value | Effect | +|---|---|---| +| `service.DISABLE_REGISTRATION` | `false` | Registration is enabled | +| `service.ALLOW_ONLY_EXTERNAL_REGISTRATION` | `false` | Native local sign-up allowed (was `true` = OAuth-only) | +| `service.ENABLE_CAPTCHA` | `true` | Captcha required on the signup form | +| `service.CAPTCHA_TYPE` | `cfturnstile` | Cloudflare Turnstile | +| `service.CF_TURNSTILE_SITEKEY` | widget id | Public; rendered in the page | +| `service.CF_TURNSTILE_SECRET` | from `forgejo-turnstile` Secret | Server-side verification | +| `service.REGISTER_EMAIL_CONFIRM` | `true` | Account inactive until email is confirmed | +| `mailer.*` | see below | Sends the activation email | +| `oauth2_client.ENABLE_AUTO_REGISTRATION` | `true` | First GitHub (OAuth2) sign-in auto-creates the account | + +Captcha guards **registration only** — `REQUIRE_CAPTCHA_FOR_LOGIN` is left at the +default `false`, so existing users are not captcha'd on every login. + +## Cloudflare Turnstile widget — `turnstile.tf` + +- The widget is a Terraform resource: `cloudflare_turnstile_widget.forgejo_signup` + (mode `managed`, domain `forgejo.viktorbarzin.me`), created with the CF Global + API Key already wired in `cloudflare_provider.tf`. The account id is resolved + via `data.cloudflare_accounts`. +- `.id` is the **public sitekey** (passed as a plain env value). `.secret` is the + **secret key**, stored in the `forgejo-turnstile` K8s Secret and injected via + `secret_key_ref`. The secret also lives in TF state (Tier-1 PG, encrypted at + rest) — same trust level as the CF API key already in state. +- Forgejo is **non-proxied** (direct A record to Traefik), but Turnstile is a + client-side JS widget served from `challenges.cloudflare.com`, so proxy status + is irrelevant — the widget works regardless. + +**Rotate the widget secret** (e.g. if it leaks): +``` +cd stacks/forgejo && vault login -method=oidc +../../scripts/tg apply --non-interactive -replace=cloudflare_turnstile_widget.forgejo_signup +``` +This mints a new sitekey+secret, updates the `forgejo-turnstile` Secret, and (via +the Reloader annotation) rolls the Forgejo pod. Verify the new sitekey appears in +the `/user/sign_up` HTML afterwards. + +## Mailer — `email-secret.tf` + `[mailer]` env + +- Forgejo sends as **`noreply@viktorbarzin.me`** via **`mail.viktorbarzin.me:587`** + with `PROTOCOL=smtp+starttls`. This reuses the same mailserver SASL account + Authentik uses (`stacks/authentik/email-secret.tf`) — one credential, one + rotation point. +- **The host MUST be `mail.viktorbarzin.me`, not `mailserver.mailserver.svc`**: + the mailserver serves the `*.viktorbarzin.me` wildcard cert, which does not + cover the `.svc` DNS name, so STARTTLS cert verification would fail. + `mail.viktorbarzin.me` resolves in-cluster (→ `10.0.20.1`) and matches the cert. +- The password is synced from Vault `secret/authentik` → `smtp_password` by the + `forgejo-email` ExternalSecret (ESO `ClusterSecretStore vault-kv`) into the + `forgejo-email` K8s Secret (key `PASSWD`), referenced by `FORGEJO__mailer__PASSWD`. +- The deployment carries `reloader.stakater.com/auto: "true"`, so a rotation of + either secret rolls the pod automatically. + +## GitHub sign-in (OAuth2 source) + +People can **sign up / sign in with GitHub** — the active Forgejo OAuth2 source. GitHub sign-up is **zero-click** (auto-registration creates the account on first login). + +> **Authentik is DISABLED on purpose** (2026-06-19). `ENABLE_AUTO_REGISTRATION` is GLOBAL across OAuth sources, and Authentik's `preferred_username` claim is the user's **email** — invalid as a Forgejo username, which 500'd auto-create. Viktor's Forgejo email (`me@viktorbarzin.me`) does not match his Authentik email (`vbarzin@gmail.com`), so account-linking can't bridge it. Per his directive GitHub was prioritised; the Authentik source was deactivated via `UPDATE login_source SET is_active=0 WHERE name='Authentik'` in the forgejo MySQL DB. **Re-enable** with `is_active=1` after fixing Authentik's username claim. + +- **Source** (Forgejo DB, *not* Terraform — added via CLI, same as Authentik): + ``` + forgejo admin auth add-oauth --name github --provider github --key --secret + ``` + The source **name must stay `github`** — it is part of the callback URL + (`/user/oauth2/github/callback`) registered on the GitHub side, so renaming it + breaks the callback. `forgejo admin auth list` shows it (ID 2). +- **GitHub OAuth App**: a classic OAuth App under the ViktorBarzin GitHub account + (Settings → Developer settings → OAuth Apps). Homepage + `https://forgejo.viktorbarzin.me`, callback + `https://forgejo.viktorbarzin.me/user/oauth2/github/callback`. GitHub has **no + API to create OAuth Apps** — creating it is a browser-only step. +- **Credentials**: Vault `secret/viktor` → `forgejo_github_oauth_client_id` / + `forgejo_github_oauth_client_secret` (kept for recovery; the live values are in + Forgejo's DB). +- **Auto-registration**: `FORGEJO__oauth2_client__ENABLE_AUTO_REGISTRATION=true` + (`main.tf`) makes a first GitHub login create the account directly. The GitHub + identity is the trust gate for this path — the Turnstile captcha + email + confirmation only apply to the **native** signup form, not OAuth. + +**Rotate the GitHub client secret** — generate a new one in the GitHub OAuth App, then: +``` +vault kv patch secret/viktor forgejo_github_oauth_client_secret= +POD=$(kubectl -n forgejo get pod -l app=forgejo -o jsonpath='{.items[0].metadata.name}') +kubectl -n forgejo exec "$POD" -- su-exec git forgejo admin auth update-oauth --id 2 --secret +``` +(Source id from `forgejo admin auth list`.) + +**Recreate after a Forgejo DB loss**: the source is not in Terraform, so after a +from-scratch restore, re-run the `add-oauth` command above with the Vault creds. + +## Re-closing / tightening signups + +Edit `stacks/forgejo/main.tf` and `scripts/tg apply` (or commit + push — CI +applies): + +- **OAuth-only again** (revert this change): set + `FORGEJO__service__ALLOW_ONLY_EXTERNAL_REGISTRATION` back to `"true"`. +- **No new accounts at all** (admins create them): set + `FORGEJO__service__DISABLE_REGISTRATION` to `"true"`. +- **Require admin approval per signup** (strongest, instead of email confirm): + set `REGISTER_MANUAL_CONFIRM=true` **and** `REGISTER_EMAIL_CONFIRM=false` + (Forgejo makes the two mutually exclusive). New accounts then queue under Site + Administration → Identity & Access → Accounts until an admin activates them. + +## Handling spam / abuse accounts + +A signup that clears Turnstile + email confirmation is still a real, low-privilege +Forgejo user. To deal with abuse: +- **Ban/delete** via Site Administration → Identity & Access → Accounts, or + `forgejo admin user delete --username ` inside the pod + (`kubectl -n forgejo exec deploy/forgejo -- forgejo admin user ...`). +- New users get Forgejo defaults (they can create repos/orgs). If abuse warrants, + tighten with `[service].DEFAULT_ALLOW_CREATE_ORGANIZATION=false` and/or + `[repository].MAX_CREATION_LIMIT` (add as env vars; out of scope for the initial + open-signups change). + +## Operational notes + +- The Forgejo deployment is **single-replica with `Recreate` strategy**, so any + config apply briefly restarts the pod (git remote + OCI registry unavailable for + a few seconds). Expected, not an incident. +- The signup page is **not** behind Cloudflare's bot-fight (Forgejo is + non-proxied) — Turnstile + email confirmation are the bot gate. CrowdSec + + Traefik rate limiting still front the host. + +## Verify it's working + +``` +POD=$(kubectl -n forgejo get pod -l app=forgejo -o jsonpath='{.items[0].metadata.name}') +# Env present: +kubectl -n forgejo exec "$POD" -- env | grep -E 'ALLOW_ONLY_EXTERNAL|ENABLE_CAPTCHA|CAPTCHA_TYPE|CF_TURNSTILE_SITEKEY|REGISTER_EMAIL_CONFIRM|mailer__ENABLED' +# Turnstile widget rendered on the form: +kubectl -n forgejo exec "$POD" -- wget -qO- http://localhost:3000/user/sign_up | grep -oE 'cf-turnstile|data-sitekey="[^"]*"' +# Secrets healthy: +kubectl -n forgejo get externalsecret forgejo-email +kubectl -n forgejo get secret forgejo-email forgejo-turnstile +``` +A full real-world check is to register a throwaway account and confirm the +activation email arrives. The mailer transport (server/port/cert/cred) is shared +with Authentik, which is already in production for external user enrollment. diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 5439a498..021c588f 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -2,9 +2,9 @@ ## Overview -Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s -VMs are upgraded automatically by a weekly detection CronJob that seeds a -chain of small phase Jobs. Each Job is **pinned to a node that is NOT its +Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 7 K8s +nodes (k8s-master + k8s-node1..6) are upgraded automatically by a nightly +detection CronJob that seeds a chain of small phase Jobs. Each Job is **pinned to a node that is NOT its drain target** — so no pod in the chain can preempt itself. The chain (23:00 UTC nightly): @@ -36,14 +36,17 @@ envsubst on /template/job-template.yaml | kubectl apply -f - ▼ Job 0 — preflight (pinned: k8s-node1) + ├── compat-gate: addon/API/containerd support for target (else BLOCK+alert) ├── All nodes Ready + no Mem/Disk pressure ├── halt-on-alert (kured-style ignore-list) ├── 24h-quiet baseline (no Ready transitions <24h ago) - ├── kubeadm upgrade plan matches target + ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume) + ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block) + ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── SSH master: containerd skew fix (if master < workers) - ├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor) + ├── SSH all 7 nodes: apt repo URL rewrite (only kind=minor) └── spawn_next → k8s-upgrade-master- ▼ @@ -87,6 +90,59 @@ Job 6 — postflight (no pinning) **adding a node needs no change** — the chain upgrades every worker still off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed). +### Auto-upgrade compat gate + +The chain now attempts **patch AND minor** upgrades autonomously — but before any +mutation, `phase_preflight` runs `compat-gate.py` **FIRST** and **REFUSES (blocks) +the upgrade** if any of these hold for the detected target: + +- a **critical addon's running version doesn't support the target k8s minor** + (running version > the addon's highest-supported minor in the compat matrix), +- an **in-use deprecated API is removed at/before the target** — measured live + from `apiserver_requested_deprecated_apis` (something is still calling a + group/version that the target k8s drops), or +- a **node's containerd is below the target's floor** (the minimum containerd the + target k8s requires). + +The addon check is **scoped to minor jumps**: a target **at or below the running +k8s minor** (a patch) crosses into no new minor, so the running cluster is itself +proof the installed addons work there — `compat-gate.py` skips the addon ceilings +when `target_minor <= running_minor`. (Without this a conservative ceiling such as +ESO 0.12 → 1.31 would false-block a 1.34.x **patch** on a cluster already running +1.34 — fixed 2026-06-20.) The deprecated-API and containerd checks are naturally +inert for a patch (no API removal or containerd floor occurs inside a minor). + +This is the **"auto-upgrade when we can, halt + alert when we can't"** contract. + +**On a block**, the gate: +- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked` + Prometheus alert), +- Slacks the **specific reasons** (which addon/API/node, current vs required), and +- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet, + this is not a failure). Because the block happens **before any mutation, no + rollback is involved**; nothing was changed. + +**To clear a block**: upgrade the named addon (or migrate the API caller off the +deprecated group/version, or bump containerd on the named node) so the offending +condition no longer holds. The **next nightly run then proceeds automatically** — +no manual chain restart needed. + +The **compat matrix** lives in +`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest +supported k8s minor`, populated from each addon's own compatibility docs. **Keep +it current**; the gate reads it on every run. Gate logic: +`stacks/k8s-version-upgrade/scripts/compat-gate.py`. + +> **Both** detector probes against `pkgs.k8s.io` follow the 302 redirect via `-L`: +> the next-minor *availability* probe (`HEAD .../v/deb/Release`) **and** +> the next-minor *patch* probe (`GET .../v/deb/Packages`, which resolves +> the exact `X.Y.Z`). The Packages probe lacked `-L` until 2026-06-20 — `pkgs.k8s.io` +> 302-redirects every request, so without it curl returned an empty body, +> `NEXT_MINOR_PATCH` came back empty, and the detector silently fell through to +> "No upgrade needed". That is why the **2026-06-19 nightly run no-op'd** instead of +> resolving the 1.35 target. With both probes on `-L`, **minor versions are detected** +> and gated behind the compat check above before the chain acts on them. + ## Components ### Shared resources (one-time, Terraform-managed) @@ -117,8 +173,26 @@ Pushed by upgrade-step.sh during phase execution; observed by the - **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout. - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. -- **`K8sUpgradeChainJobFailed`** — `kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). -- All four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. +- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires. +- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. +- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. + +### Nightly upgrade report (Slack) + +CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`, +default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London +alert-digest) posts ONE Slack summary each morning of the previous night's run: +running version, detector freshness, detected target + kind, the outcome +(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded / +🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads +the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh +blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap. +Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`. +This is the day-to-day visibility layer (it does NOT replace the alerts above — +those fire on problems; this reports the outcome every night). Manual run: +`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test` +(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip +`K8sUpgradeChainJobFailed`). ### CoreDNS is NOT upgraded by kubeadm here @@ -150,27 +224,54 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names ## Common Operations -### Post-upgrade: restore apiserver OIDC (REQUIRED after any control-plane bump) +### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24) `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` -and drops the `--authentication-config` flag**, silently disabling apiserver -OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get -401). This is not auto-detected (the `rbac` stack's `null_resource` trigger is a -content hash that doesn't change). After any control-plane upgrade, re-apply: +from kubeadm-config**. apiserver auth uses a structured multi-issuer +`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to +still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade +reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does +NOT crash on this — verified by isolated repro; it's recoverable via the restore +script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue — +etcd IO starvation**, not this drift; post-mortem: +`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`. + +**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now +**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting +`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of +its remote script. So kubeadm regenerates a **correct** manifest and the apiserver +upgrades with a pure image bump — `kubeadm upgrade diff ` shows only the +image change. Zero live impact (the CM is read only during an upgrade). + +**Backstops:** +- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does + NOT block — the drift only breaks SSO, which is recoverable) if + `--authentication-config` would still be dropped. +- The `rbac` stack still publishes its restore script to the + `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on + master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with + auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also* + re-reconciles kubeadm-config. Self-skips when master is already at target. + +**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the +chain logged `WARN: --authentication-config absent after re-apply`: ```bash cd stacks/rbac TF_VAR_ssh_private_key="$(cat ~/.ssh/id_ed25519)" \ VAULT_ADDR=https://vault.viktorbarzin.me ../../scripts/tg apply \ - --non-interactive -target=module.rbac.null_resource.apiserver_oidc_config + --non-interactive -target=module.rbac.null_resource.apiserver_oidc_config \ + -replace=module.rbac.null_resource.apiserver_oidc_config ``` -(`ssh_private_key` must be a key authorized for `wizard@`; it is not yet -wired from Vault.) The provisioner re-writes `/etc/kubernetes/pki/auth-config.yaml` -(both `kubernetes` + `k8s-dashboard` issuers), re-adds the flag, and -health-gates `/livez` with auto-rollback. Verify: `curl -sk -https://localhost:6443/livez` on the master = `ok`, and the apiserver manifest -contains `--authentication-config`. See `docs/plans/2026-06-04-k8s-dashboard-sso-design.md`. +(`-replace` is **required** — the `null_resource` trigger is a content hash that +doesn't change, so a plain `-target` apply is a no-op. `ssh_private_key` must be a +key authorized for `wizard@`.) The provisioner re-writes +`/etc/kubernetes/pki/auth-config.yaml` (both `kubernetes` + `k8s-dashboard` +issuers), re-adds the flag, and health-gates `/livez` with auto-rollback. Verify: +`curl -sk https://localhost:6443/livez` on the master = `ok`, and the apiserver +manifest contains `--authentication-config`. See +`docs/plans/2026-06-04-k8s-dashboard-sso-design.md`. ### Verify the pipeline is healthy ```bash @@ -356,6 +457,13 @@ kill %1 ## Past Incidents +### 2026-06-18 — Preflight gate-4 wedged a partial (master-ahead) chain +- A prior 1.34.9 run upgraded k8s-master + k8s-node1, then stopped; node2-6 stayed on 1.34.8. +- Every nightly preflight then aborted at the **kubeadm-plan-target gate**: `kubeadm upgrade plan` runs on k8s-master, already on 1.34.9, so it emitted no `kubeadm upgrade apply vX.Y.Z` line → empty `plan_target` → `'' != '1.34.9'` → `exit 1`. Deterministic, not transient (gates 1-3 all green; no critical alert was firing). The failed preflight self-cleaned each night (2026-06-17 retry-on-failure) but re-failed identically. +- The two `in_flight`-based alerts stayed blind (preflight aborts pre-metric); `K8sUpgradeChainJobFailed` (warning) surfaced it. +- **Collateral**: the earlier master bump had also dropped apiserver `--authentication-config` (SSO broke); restored separately via the `rbac` stack's `apiserver_oidc_config`. +- **Mitigation**: `phase_preflight` now **skips the kubeadm-plan-target gate when k8s-master is already on TARGET_VERSION** (mirrors the at-target self-skip already in `phase_master`/`phase_worker`). Remaining workers are validated by their own phases; the detector's apt-cache probe already confirmed the target is installable. + ### 2026-05-11 — Self-preemption (agent → Job-chain rewrite) - The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4. - During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself. @@ -369,6 +477,8 @@ kill %1 |------|-------| | Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` | | Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` | +| Compat gate (addon/API/containerd block logic) | `infra/stacks/k8s-version-upgrade/scripts/compat-gate.py` | +| Compat matrix (addon → highest supported k8s minor) | `infra/stacks/k8s-version-upgrade/scripts/addon-compat.json` | | Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` | | Per-node upgrade script | `infra/scripts/update_k8s.sh` | | Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") | diff --git a/docs/runbooks/t3-version-bump.md b/docs/runbooks/t3-version-bump.md index a16d65bf..cf8359e5 100644 --- a/docs/runbooks/t3-version-bump.md +++ b/docs/runbooks/t3-version-bump.md @@ -37,6 +37,19 @@ logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing `T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen` → Alertmanager → Slack. +## Idle migrator — draining deferrals (`scripts/t3-migrate-idle.sh`) + +Step 5 DEFERS any instance with an active agent, recording `/var/lib/t3-autoupdate/deferred/` (= the target version). Without a drainer, a user busy at every 04:00 window never migrates and their client shows *"Client and server versions differ"* for days. `t3-migrate-idle.timer` (overnight, every 20 min 01:00–05:40) drains those markers: + +- Per marker: skip + clear if the unit is gone or was already restarted *after* the deferral; otherwise restart the still-stale `t3-serve@` onto the current binary **only when that user is idle** — `state.sqlite` shows zero `active_turn_id` (no in-flight turn) AND ≥ `T3_MIGRATE_QUIET_SECONDS` (default 900 = 15 min) since the last thread activity — then verify pairing and clear the marker. **Fail-closed:** any query/parse doubt → skip, retry next tick. +- It restarts via the SAME `safe_restart_unit` the daily canary uses (sourced `t3-safe-restart.sh`: backup → restart → verify → recover). The shared `/etc/t3-autoupdate.freeze` halts it too. +- **Force / preview:** + ```bash + sudo systemctl start t3-migrate-idle.service # run a drain pass now (still idle-gated) + sudo env T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle # log decisions, act on nothing + ``` +- **Rare-tail failure:** if a deferred user's forward migration fails at idle restart (already gated against a copy of their real DB at install), `safe_restart_unit` restores their DB + freezes + alerts. The binary rollback is a no-op (the build was already accepted, so other users are unaffected), but that user's serve may crashloop on the restored DB until the freeze is cleared and the build investigated (manual rollback below). + ## Operations **Freeze / revert (stop tracking right now — the fast "make it stop"):** diff --git a/docs/runbooks/tripit-external-signup.md b/docs/runbooks/tripit-external-signup.md deleted file mode 100644 index 0172c9b1..00000000 --- a/docs/runbooks/tripit-external-signup.md +++ /dev/null @@ -1,226 +0,0 @@ -# Runbook — TripIt external user self-signup (email + passkey) - -Implements ADR-0020 (tripit repo): people outside the homelab self-register to -TripIt with **email + a passkey** (no password), are auto-tagged into the -**`TripIt External`** Authentik group, and are fenced to `tripit.viktorbarzin.me` -only. Audience: people Viktor knows; open public registration. - -> **Safety model.** Containment is two-layered. (1) **Forward-auth apps** — the -> branch in `stacks/authentik/admin-services-restriction.tf` admits `TripIt -> External` to `tripit.viktorbarzin.me` and denies every other `auth="required"` -> host. (2) **OIDC apps** — the branch does NOT cover OIDC (it bypasses -> forward-auth); External users are contained because every sensitive OIDC app -> already requires a trusted group they do not hold (audit below). The no-lockout -> guarantee is that the group is created **empty**, so the new branch matches -> zero existing users on day one. - -## OIDC app authorization audit (2026-06-15, read-only) - -A parentless `TripIt External` user holds NONE of these groups, so: - -| OIDC app | Requires | External user | -|---|---|---| -| Immich, Grafana, Linkwarden, Cloudflare Access | `Home Server Admins` | DENIED ✓ | -| Forgejo | `Task Submitters` / `Forgejo Users` | DENIED ✓ | -| Headscale | `Headscale Users` | DENIED ✓ | -| wrongmove | `Wrongmove Users` | DENIED ✓ | -| **Vault** | **was OPEN** → bound to `Allow Login Users` in Step 3 | DENIED after Step 3 | -| Kubernetes, Kubernetes Dashboard | OPEN | harmless — apiserver rejects OIDC tokens (idle) | -| TripIt App, Public | OPEN | by design (TripIt's own provider / guest) | - -Vault's JWT `default` role grants only Vault's built-in `default` policy (token -self-management, cubbyhole — **no** secret access), so the pre-fix exposure was a -near-powerless token; Step 3 closes it anyway. - ---- - -## Pre-flight gates (STOP if any fails) - -1. **`TripIt External` is net-new / empty** (no-lockout precondition): - ``` - kubectl -n authentik exec -i deploy/goauthentik-server -- ak shell <<'PY' - from authentik.core.models import Group - g = Group.objects.filter(name="TripIt External").first() - print("exists:", bool(g), "members:", g.users.count() if g else 0) - PY - ``` - Expect `exists: False`. If it exists with members → STOP. -2. **Authentik image pin matches live (B5)** — the policy edit auto-applies the - whole `authentik` stack; a stale pin re-triggers the 2026-06-10 downgrade - boot-storm: - ``` - kubectl -n authentik get deploy -o custom-columns=N:.metadata.name,IMG:.spec.template.spec.containers[0].image - ``` - Every `goauthentik`/`ak-outpost` image tag MUST equal - `stacks/authentik/modules/authentik/values.yaml` `global.image.tag` - (currently `2026.2.4`). If they differ → refresh the pin first. - ---- - -## Step 1 — Terraform (group + fence branch) - -Already written on this branch: -- `stacks/authentik/tripit-external.tf` — the empty, parentless group. -- `stacks/authentik/admin-services-restriction.tf` — the prepended fence branch. - -**Local plan gate (B4 — CI auto-applies on push with `-auto-approve`, so there is -NO human plan review in the apply path; do it here):** -``` -vault login -method=oidc -cd stacks/authentik && ../../scripts/tg plan -``` -Confirm the plan is **exactly**: -- `+ authentik_group.tripit_external` (create) -- `~ authentik_policy_expression.admin_services_restriction` (update in place — the - `expression` body gains ONLY the new branch; every other line byte-identical) -- **`Plan: 1 to add, 1 to change, 0 to destroy.`** - -ABORT if the plan shows any destroy/replace, any `authentik_provider_*` / -`authentik_outpost` / `authentik_flow*` / `helm_release`, or any other expression -change. - -**Apply** (presence-claim courtesy, then push = apply; land human-watched, B5): -``` -~/code/scripts/presence claim stack:authentik --purpose "ADR-0020 TripIt External group + fence branch" -# push the branch to master (this triggers CI tg apply on the authentik stack) -``` -Watch: GHA → Woodpecker `default.yml` apply → outpost stays healthy -(`kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost` = 2 -IPs; an anonymous request to any `auth=required` host still 302s to Authentik). -The branch is inert (empty group) so no access changes yet. - ---- - -## Step 2 — Authentik SMTP (B1, BLOCKER before any flow) - -Email verification is the **entire identity boundary** (TripIt trusts the -Authentik email verbatim). Authentik currently has the **default/unconfigured** -transport (`email.host = localhost`), so verification/recovery mail cannot send. - -Add to **both** `server.env` and `worker.env` in -`stacks/authentik/modules/authentik/values.yaml` (wire the password from a secret; -the cluster mailserver is what TripIt already relays through — -`mailserver.mailserver.svc`): -```yaml - - { name: AUTHENTIK_EMAIL__HOST, value: "mailserver.mailserver.svc" } - - { name: AUTHENTIK_EMAIL__PORT, value: "587" } - - { name: AUTHENTIK_EMAIL__USE_TLS, value: "true" } - - { name: AUTHENTIK_EMAIL__FROM, value: "noreply@viktorbarzin.me" } - - { name: AUTHENTIK_EMAIL__USERNAME, value: "" } # confirm relay creds - - { name: AUTHENTIK_EMAIL__PASSWORD, valueFrom: { secretKeyRef: { name: , key: } } } -``` -**Gate:** after apply, Authentik UI → System → Settings (or an Email stage) → -**Send test email**; it must arrive. Then prove enrollment cannot complete for an -address you do NOT control. - ---- - -## Step 3 — Bind Vault → `Allow Login Users` (close the one open OIDC gap) - -Authentik UI → Applications → **Vault** → bind an authorization policy requiring -group **`Allow Login Users`** (the base group every real homelab user inherits; -parentless `TripIt External` is excluded). This changes nothing for existing -users and denies External users at the Vault consent step. -Verify: an External test account (Step 6) cannot complete Vault OIDC login. - ---- - -## Step 4 — Build the flows (Authentik UI; UI-managed per ADR split) - -All three flows: designation as noted, no password stage. - -**Flow `tripit-enrollment`** (Enrollment): -| Order | Stage | Key settings | -|---|---|---| -| 5 | Captcha | reCAPTCHA **v2 checkbox** keys (v3/invisible fail — see `crowdsec-recaptcha-key-type`) | -| 10 | Identification | email only; **no** `password_stage`; `sources` optional | -| 20 | Email (verification) | activate, blocking — **before** user_write | -| 30 | WebAuthn authenticator setup | `user_verification = required`, `resident_key = required` | -| 40 | User Write | **`create_users_group` = `TripIt External`** (the keystone tag); `user_type = external` | -| 50 | User Login | session as default (`weeks=4`) | - -**Flow `tripit-login`** (Authentication, passwordless): -Identification (sets `enrollment_flow`/`recovery_flow`) → Authenticator -Validation (`device_classes = [webauthn]`, `user_verification = required`) → User -Login. Prefer routing a passkey-less email to recovery over minting a credential. - -**Flow `tripit-recovery`** (Recovery): -Identification (`pretend_user_exists = on`) → Email (recovery link) → WebAuthn -authenticator setup → User Login. Notify the account on recovery + new-passkey. - -> Do **NOT** bind the `brute-force-protection` ReputationPolicy to these flows — -> it denies anonymous users (2026-04-06 regression). The Captcha is the bot gate. - ---- - -## Step 5 — Surface "Sign up" - -Recommended: a **TripIt-scoped** signup link / share-invite rather than a global -login-screen button (narrower bot surface). Enrollment URL: -`https://authentik.viktorbarzin.me/if/flow/tripit-enrollment/`. - ---- - -## Step 6 — Verification (before/after — "all access keeps working") - -Hosts for the matrix (must be real `auth="required"` default-allow hosts, NOT -`auth="app"` apps like immich/nextcloud which bypass the catch-all): -`tripit`, `family`, `hackmd`, `health` (default-allow) + `terminal` (admin-only). - -**Before** (capture per user, no redirect-follow; 200=ALLOW, 302→authentik/403=DENY): -``` -COOKIE='authentik_session='; for H in tripit family hackmd health terminal; do - printf '%-10s %s\n' "$H" "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: $COOKIE" https://$H.viktorbarzin.me/)"; done -``` -Representative non-admin: `kadir.tugan@gmail.com` (Wrongmove-only) → tripit/family/hackmd/health ALLOW, terminal DENY. Admin `vbarzin@gmail.com` → all ALLOW. - -**After Step 1 apply — regression:** re-run identically; both users' results MUST -be unchanged (diff empty). - -**After flows — external smoke test (the security proof):** enrol a throwaway -account via the enrollment URL (email verify + passkey). Confirm it is tagged -`TripIt External`, then with its cookie: -``` -for H in tripit family hackmd health terminal frigate; do printf '%-10s %s\n' "$H" \ - "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: authentik_session=" https://$H.viktorbarzin.me/)"; done -``` -Expect **tripit=200, every other host DENY** (family/hackmd/health were ALLOW for -kadir — the contrast is the fence proof). Then: -- **OIDC containment:** with the external account, attempt OIDC login to Vault, - Immich, Forgejo, Grafana → each must be DENIED at the app's own login. -- **Auto-provision:** the TripIt `users` row exists (CNPG primary in ns `dbaas`: - `select id,email from tripit.users where email=''`). -- **Walling-off guard** `AuthentikWallingOffPublicPath` stays green. - -**Any 200 on a non-tripit host, or any OIDC app admitting the external account → -ROLLBACK.** - ---- - -## Step 7 — Standing regression probe (recommended) - -Add a permanent `TripIt External` identity to the `blackbox-exporter` guard -(`stacks/monitoring/.../authentik_walloff_probe.tf` pattern): assert 200 on -`tripit.viktorbarzin.me` AND DENY on `family.viktorbarzin.me`. This converts the -"branch stays first" and "user_write keeps the keystone tag" invariants into -automated `#security` alerts. - ---- - -## Rollback - -Revert the `admin-services-restriction.tf` expression (delete the branch) and push -(= apply); removing a prepended `if g: return …` is behaviour-preserving on -non-members, restoring prior authz. Disable/delete the throwaway external account -(with the branch gone, a tagged account falls into default-allow). The empty group -may stay (harmless). Plan-gate the revert too. - -## Operational invariants - -- `TripIt External` stays **parentless** (never under `Allow Login Users`). -- The fence branch stays **first** in `admin-services-restriction`. -- **Never** co-assign `TripIt External` to a trusted/internal user. -- The `tripit-enrollment` user_write **`create_users_group`** setting is the - keystone — re-verify after any flow edit (clearing it makes UNtagged accounts - that fall into default-allow). -- Authentik SMTP is a live dependency of enrollment + recovery. diff --git a/modules/create-template-vm/cloud_init.yaml b/modules/create-template-vm/cloud_init.yaml index 1e4fcafa..11a86b6e 100644 --- a/modules/create-template-vm/cloud_init.yaml +++ b/modules/create-template-vm/cloud_init.yaml @@ -8,6 +8,13 @@ users: sudo: ALL=(ALL) NOPASSWD:ALL ssh_authorized_keys: - ${authorized_ssh_key} + # k8s-upgrade pipeline key (matches Vault secret/k8s-upgrade/ssh_key_pub). + # The automated k8s-version-upgrade chain SSHes in as `wizard` to drain + + # upgrade each node; WITHOUT this a freshly-provisioned node is invisible + # to the upgrade pipeline (node4/5/6 hit exactly this — Permission denied — + # 2026-06-17). Hardcoded: it's a public key and the keypair is stable; if + # it's ever rotated, update this line and Vault together. + - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIElH9x76UNA8UNxrxTjREYz4hz1fbCdRwAXbOkJ5FnSM k8s-upgrade-pipeline passwd: ${passwd} lock_passwd: false # enable passwd login shell: /bin/bash diff --git a/modules/kubernetes/ingress_factory/main.tf b/modules/kubernetes/ingress_factory/main.tf index 0f239fb4..fc9bc9f5 100644 --- a/modules/kubernetes/ingress_factory/main.tf +++ b/modules/kubernetes/ingress_factory/main.tf @@ -107,10 +107,6 @@ variable "custom_content_security_policy" { type = string default = null } -variable "exclude_crowdsec" { - type = bool - default = false -} variable "full_host" { type = string default = null @@ -310,7 +306,6 @@ resource "kubernetes_ingress_v1" "proxied-ingress" { "traefik-error-pages@kubernetescrd", var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd", var.custom_content_security_policy == null ? "traefik-csp-headers@kubernetescrd" : null, - var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd", local.effective_anti_ai ? "traefik-ai-bot-block@kubernetescrd" : null, local.effective_anti_ai ? "traefik-anti-ai-headers@kubernetescrd" : null, local.auth_middleware, diff --git a/scripts/claude-auth-sync@.service b/scripts/claude-auth-sync@.service new file mode 100644 index 00000000..3750f295 --- /dev/null +++ b/scripts/claude-auth-sync@.service @@ -0,0 +1,20 @@ +[Unit] +Description=Validate and back up Claude OAuth credentials for %i +Documentation=https://github.com/ViktorBarzin/infra/blob/master/docs/runbooks/claude-auth-renew-workstation.md +Wants=network-online.target +After=network-online.target + +[Service] +Type=oneshot +User=%i +Group=%i +Environment=HOME=/home/%i +Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin +ExecStart=/usr/local/bin/claude-auth-sync + +# Credential and Vault access are required; keep the remaining host surface narrow. +NoNewPrivileges=true +PrivateTmp=true +ProtectSystem=strict +ProtectHome=read-only +ReadWritePaths=-/home/%i/.claude -/home/%i/.claude.json -/home/%i/.config/claude-auth-sync -/home/%i/.local/state/claude-auth-sync diff --git a/scripts/claude-auth-sync@.timer b/scripts/claude-auth-sync@.timer new file mode 100644 index 00000000..b25f2ecd --- /dev/null +++ b/scripts/claude-auth-sync@.timer @@ -0,0 +1,12 @@ +[Unit] +Description=Keep Claude OAuth credentials valid and recoverable for %i + +[Timer] +OnBootSec=10m +OnUnitActiveSec=6h +Persistent=true +RandomizedDelaySec=10m +Unit=claude-auth-sync@%i.service + +[Install] +WantedBy=timers.target diff --git a/scripts/t3-autoupdate.sh b/scripts/t3-autoupdate.sh index bdd26b87..a3928211 100644 --- a/scripts/t3-autoupdate.sh +++ b/scripts/t3-autoupdate.sh @@ -21,7 +21,7 @@ # - canary rollout: restart idle instances ONE AT A TIME, verifying pairing # through the real dispatch after each, and roll back (binary + that user's DB) # + self-freeze on the first failure — active-agent instances are deferred, -# never killed; +# never killed (deferred instances are recorded for t3-migrate-idle to drain); # - rollback target is the recorded LAST-GOOD build, not "whatever was installed". # Detection backstop (real-user pairing failure/fallback) lives in the dispatch # logs + Loki alerts (T3PairingBroken / T3PairFallbackHigh / T3AutoUpdate*). @@ -29,24 +29,17 @@ # Full procedure + manual rollback: docs/runbooks/t3-version-bump.md. set -uo pipefail +# ---- autoupdate-specific config (shared config + helpers come from the lib) ----- T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest) T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking) -FREEZE_FILE="${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}" -STATE_DIR="${T3_STATE_DIR:-/var/lib/t3-autoupdate}" -LAST_GOOD_FILE="$STATE_DIR/last-good" -BACKUP_DIR="${T3_BACKUP_DEST:-/var/backups/t3-state}" SMOKE_PORT="${T3_SMOKE_PORT:-3799}" -DISPATCH="${T3_DISPATCH:-127.0.0.1:3780}" -USER_MAP="${T3_USER_MAP:-/etc/ttyd-user-map}" DRY_RUN="${T3_DRY_RUN:-0}" TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it -LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; } -ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; } -# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line). -osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; } -# authentik username for an OS user (reverse map; first match) — for dispatch verify. -ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; } +LOG_TAG=t3-autoupdate +# shellcheck source=scripts/t3-safe-restart.sh +. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}" + # is $1 a strictly-newer version than $2 (version-sort)? newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; } @@ -86,27 +79,21 @@ LOG "candidate: $current -> $target (track=$T3_TRACK, last_good=$last_good, dry_ # ---- helpers: backup, health-check, rollback, restart-verify -------------------- # Online consistent per-user snapshot (run AS the owner so WAL stays owned; never # stops the serve). Sets $ADMIN_SEED to wizard's backup for the migration health -# check. Mirrors t3-backup-state.sh. +# check. Mirrors t3-backup-state.sh. (backup_user lives in the shared lib.) ADMIN_SEED="" backup_all() { - local u src out dst ts; ts="$(date +%Y%m%d-%H%M%S)" + local u dst for u in $(osusers); do - src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || continue - out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite" - install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out" - if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then + if dst="$(backup_user "$u")"; then LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)" [ "$u" = "wizard" ] && ADMIN_SEED="$dst" else - LOG "WARN: pre-bump backup FAILED for $u ($src)"; rm -f "$dst" + LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)" fi done [ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)" } -# newest pre-bump backup taken THIS run for a user (for restore-on-rollback). -prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; } - # health_check [seed_db]: start a throwaway serve (seeded with a copy of a # real populated DB if given, so the forward migration runs on real data), then do # the real mint -> credential-exchange -> t3_session pairing handshake with the @@ -143,27 +130,12 @@ health_check() { rm -rf "$dir"; return 1 } -# roll the GLOBAL binary back to last-good. Pre-restart failures need only this -# (no real DB migrated yet); post-restart failures also restore the user's DB. -rollback_binary() { - LOG "rolling back binary $target -> $last_good" - if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi - LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1 -} - # is this t3-serve@ running an active agent (claude/codex/opencode)? never restart those. unit_busy() { local unit="$1" pid; pid="$(systemctl show -p MainPID --value "$unit" 2>/dev/null)" [ -n "$pid" ] && [ "$pid" != "0" ] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode' } -# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie). -verify_pairing() { - local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; } - out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)" - printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session=' -} - # ---- 3. DRY RUN: preview only (install candidate to temp prefix, gate it) ------- if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would back up [$(osusers | tr '\n' ' ')]; testing candidate $target in a temp prefix (no global change, no restarts)" @@ -196,31 +168,15 @@ restarted=0; deferred=0 for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue if unit_busy "$unit"; then - LOG "deferring $unit (active agent) — migrates on its next idle restart"; deferred=$((deferred+1)); continue + LOG "deferring $unit (active agent) — migrates on its next idle restart" + mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle + deferred=$((deferred+1)); continue fi - systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero" - ok=0 - for _ in $(seq 1 15); do - if verify_pairing "$u"; then ok=1; break; fi - sleep 2 - done - if [ "$ok" = "1" ]; then - LOG "restarted $unit -> $target (pairing verified via dispatch)"; restarted=$((restarted+1)) + if safe_restart_unit "$unit" "$u"; then + restarted=$((restarted+1)) + rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker else - LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB" - rollback_binary - bak="$(prebump_of "$u")" - if [ -n "$bak" ]; then - systemctl stop "$unit" 2>/dev/null - if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then - rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm" - LOG "restored $u state.sqlite from $bak" - fi - systemctl start "$unit" 2>/dev/null - fi - touch "$FREEZE_FILE" 2>/dev/null - LOG "FROZEN ($FREEZE_FILE) after canary $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume" - exit 1 + exit 1 # frozen by safe_restart_unit — preserve today's behavior fi done diff --git a/scripts/t3-migrate-idle.service b/scripts/t3-migrate-idle.service new file mode 100644 index 00000000..97c28faa --- /dev/null +++ b/scripts/t3-migrate-idle.service @@ -0,0 +1,8 @@ +[Unit] +Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle +Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md +After=network.target t3-dispatch.service + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/t3-migrate-idle diff --git a/scripts/t3-migrate-idle.sh b/scripts/t3-migrate-idle.sh new file mode 100644 index 00000000..85835374 --- /dev/null +++ b/scripts/t3-migrate-idle.sh @@ -0,0 +1,86 @@ +#!/usr/bin/env bash +# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight +# t3-migrate-idle.timer). For each deferred t3-serve@, if nothing is actively +# working in that instance (no in-flight turn + a quiet buffer), restart it onto the +# current binary using the shared safe_restart_unit, then clear the marker. +# Why this exists: t3-autoupdate defers a user with an active agent at its single +# daily window; a user busy every night never migrates and their client shows +# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*. +set -uo pipefail + +LOG_TAG=t3-migrate-idle +# shellcheck source=scripts/t3-safe-restart.sh +. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}" + +QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min) +DRY_RUN="${T3_DRY_RUN:-0}" + +# pure logic: is it safe given and ? fail closed. +gate_is_safe() { + local active="$1" idle="$2" + case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe + [ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe + [ -z "$idle" ] && return 0 # no threads at all -> safe + case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe + [ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe +} + +# query a state.sqlite (path or file: URI). Echoes "|". +# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday. +gate_query() { + local db="$1" + sqlite3 -batch -noheader -separator '|' "$db" \ + "SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" +} + +# safe_to_restart : wire runuser + the user's DB into gate_query/gate_is_safe. +safe_to_restart() { + local u="$1" db row + db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1 + row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \ + "SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" 2>/dev/null)" || return 1 + gate_is_safe "${row%%|*}" "${row##*|}" +} + +main() { + # a frozen build must not be auto-migrated (shared switch with t3-autoupdate) + if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi + [ -d "$DEFER_DIR" ] || exit 0 # nothing deferred + last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper + + local marker u unit started mwritten migrated=0 skipped=0 + for marker in "$DEFER_DIR"/*; do + [ -e "$marker" ] || continue # empty-dir glob + u="$(basename "$marker")"; unit="t3-serve@$u.service" + if ! systemctl is-active --quiet "$unit"; then + LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue + fi + started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)" + mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)" + if [ "$started" -gt "$mwritten" ]; then + LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue + fi + if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi + + target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)" + if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi + if ! backup_user "$u" >/dev/null; then + LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue + fi + if safe_restart_unit "$unit" "$u"; then + LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1)) + else + LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1 + fi + done + LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)" +} + +# main-guard: run only when executed, not when sourced (tests source this file). +if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi diff --git a/scripts/t3-migrate-idle.timer b/scripts/t3-migrate-idle.timer new file mode 100644 index 00000000..0c847fa6 --- /dev/null +++ b/scripts/t3-migrate-idle.timer @@ -0,0 +1,10 @@ +[Unit] +Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration) + +[Timer] +OnCalendar=*-*-* 01..05:00/20 +RandomizedDelaySec=120 +Persistent=false + +[Install] +WantedBy=timers.target diff --git a/scripts/t3-provision-users.sh b/scripts/t3-provision-users.sh index ae1a7759..9cbc6c1e 100644 --- a/scripts/t3-provision-users.sh +++ b/scripts/t3-provision-users.sh @@ -29,6 +29,9 @@ REPO_REMOTE_BASE="${REPO_REMOTE_BASE:-https://forgejo.viktorbarzin.me/viktor}" # Per-user OIDC kubeconfig (kubelogin/PKCE; cluster server+CA copied from the admin kubeconfig). OIDC_ISSUER="${OIDC_ISSUER:-https://authentik.viktorbarzin.me/application/o/kubernetes/}" ADMIN_KUBECONFIG="${ADMIN_KUBECONFIG:-/home/wizard/.kube/config}" +# OS users (space-separated) that receive the vendored agent skills (scripts/workstation/claude-skills). +# Allowlist: install_skills no-ops for anyone not listed. Extend here to roll out to more users. +SKILL_USERS="${SKILL_USERS:-emo}" log() { echo "[t3-provision] $*"; } run() { if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] $*"; else "$@"; fi; } @@ -251,23 +254,50 @@ env_set() { chmod 600 "$file" } -# Share the admin's Claude subscription with a non-admin: inject CLAUDE_CODE_OAUTH_TOKEN -# (the staged long-lived token) into their t3-serve env — ONLY if they have neither their -# own ~/.claude/.credentials.json (own login) nor an existing token. Never clobbers. The -# agent picks it up when its t3-serve@ instance (re)starts. -install_user_claude_token() { - local user="$1" home envf tok - local token_file="${CLAUDE_TOKEN_FILE:-/etc/t3-serve/claude-oauth-token}" +env_unset() { + local file="$1" key="$2" + [[ -f "$file" ]] || return 0 + grep -q "^${key}=" "$file" || return 0 + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] unset $key -> $file"; return 0; fi + sed -i "/^${key}=.*/d" "$file" + chmod 600 "$file" + log "removed legacy shared $key -> $(basename "$file")" +} + +# Install one user's isolated Claude credential renewal flow. The scoped periodic +# Vault token is minted only when this reconcile has admin Vault access (normal +# onboarding/deployment); routine token renewal is performed by the user service. +install_claude_auth_sync() { + local user="$1" home cfg token_file token policy home="$(getent passwd "$user" | cut -d: -f6)" [[ -z "$home" ]] && return 0 - [[ -f "$home/.claude/.credentials.json" ]] && return 0 # has own login -> leave it - [[ -r "$token_file" ]] || return 0 - envf="${ENVDIR:-/etc/t3-serve}/$user.env" - grep -q '^CLAUDE_CODE_OAUTH_TOKEN=' "$envf" 2>/dev/null && return 0 # already shared - if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] share Claude token -> $envf"; return 0; fi - tok="$(cat "$token_file")" - env_set "$envf" CLAUDE_CODE_OAUTH_TOKEN "$tok" - log "shared Claude token -> $user (t3-serve env; restart needed to take effect)" + cfg="$home/.config/claude-auth-sync" + token_file="$cfg/vault-token" + policy="workstation-claude-$user" + + # The service sandbox makes the rest of $HOME read-only. Pre-create every + # writable path before systemd enters that sandbox; ReadWritePaths cannot + # create a missing child beneath a read-only parent. + if [[ "$DRY_RUN" == 1 ]]; then + echo "[dry-run] ensure Claude-auth state dirs -> $user" + else + install -d -o "$user" -g "$user" -m 0700 "$cfg" "$home/.local/state/claude-auth-sync" + fi + + if [[ ! -s "$token_file" ]]; then + if [[ "$DRY_RUN" == 1 ]]; then + echo "[dry-run] mint scoped Claude-auth Vault token -> $user" + elif vault token lookup >/dev/null 2>&1 && \ + token="$(vault token create -orphan -period=768h -policy="$policy" \ + -display-name="devvm-claude-auth-$user" -field=token 2>/dev/null)"; then + install -d -o "$user" -g "$user" -m 0700 "$cfg" + install -o "$user" -g "$user" -m 0600 /dev/stdin "$token_file" <<<"$token" + log "minted isolated Claude-auth Vault token -> $user" + else + log "WARN: scoped Claude-auth Vault token missing for $user (run provisioner with admin VAULT_TOKEN after vault stack apply)" + fi + fi + run systemctl enable --now "claude-auth-sync@$user.timer" >/dev/null 2>&1 || true } # Re-deploy the managed per-user Claude launcher to ~/start-claude.sh. /etc/skel only @@ -354,9 +384,133 @@ install_playwright() { run systemctl enable --now "playwright-snapshot-refresh@$user.timer" >/dev/null 2>&1 || true } +# Per-user homelab-memory setup — migrate off the claude-memory MCP/plugin to the +# homelab CLI hooks (auto-recall + auto-learn + compaction backup/recovery). +# Idempotent, if-absent, ADDITIVE: never clobbers `env` (the per-user +# MEMORY_API_KEY) or other MCP servers; removes ONLY the `claude_memory` MCP. +# Reuses the user's existing key — does NOT mint one (per-user isolation stays +# deferred, design 2026-06-08). The homelab CLI (/usr/local/bin/homelab) hits the +# same remote HTTP API the MCP used. Hook scripts: $WORKSTATION_DIR/claude-hooks. +install_memory() { + local user="$1" home + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" ]] || return 0 + local src="$WORKSTATION_DIR/claude-hooks" hooks_dst="$home/.claude/hooks" settings="$home/.claude/settings.json" + [[ -d "$src" ]] || { log "WARN: $src missing -> skip memory setup for $user"; return 0; } + + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] memory: hooks + settings wire + claude_memory MCP removal -> $user"; return 0; fi + + # (1) (re)install the 4 hook scripts, owned by the user (refreshed each reconcile so fixes land) + install -d -o "$user" -g "$user" -m 0755 "$hooks_dst" + local h + for h in homelab-memory-recall.py auto-learn.py pre-compact-backup.sh post-compact-recovery.sh; do + install -o "$user" -g "$user" -m 0755 "$src/$h" "$hooks_dst/$h" + done + + # (2) wire the hooks in settings.json, if-absent + additive. Run the helper as ROOT: + # it must read $src under the admin's hardened home (mode 700), which a + # runuser-as-$user CANNOT traverse — so chown the result back to the user and + # enforce 0600 (it holds the per-user MEMORY_API_KEY). + if python3 "$src/wire-memory-hooks.py" "$home" >/dev/null 2>&1; then + [[ -f "$settings" ]] && chown "$user:$user" "$settings" 2>/dev/null || true + log "memory hooks wired -> $user" + else + log "WARN: memory hook wiring failed for $user (retries next reconcile)" + fi + [[ -f "$settings" ]] && chmod 600 "$settings" || true + + # (2b) reuse the user's existing key; warn (do NOT mint — needs an admin vault write) if absent. + if [[ -f "$settings" ]] && ! grep -q 'MEMORY_API_KEY' "$settings"; then + log "WARN: $user has no MEMORY_API_KEY in settings.json — homelab memory no-ops until an admin mints one" + fi + + # (3) remove the now-superseded claude_memory MCP (AS the user, if-present) + the plugin dir. + if runuser -u "$user" -- bash -lc 'command -v claude >/dev/null 2>&1 && claude mcp get claude_memory >/dev/null 2>&1'; then + runuser -u "$user" -- bash -lc 'claude mcp remove claude_memory >/dev/null 2>&1' && log "removed claude_memory MCP -> $user" || true + fi + if [[ -d "$home/.claude/plugins/claude-memory" ]]; then + rm -rf "$home/.claude/plugins/claude-memory" && log "removed claude-memory plugin dir -> $user" + fi + return 0 # best-effort tail must never return non-zero, else set -euo pipefail aborts the whole reconcile +} + +# Per-user agent skills, vendored from the in-repo snapshot ($WORKSTATION_DIR/claude-skills) — the +# `npx skills` upstream drifted off this exact set, so we reproduce it offline + deterministically. +# if-absent + ADDITIVE: copies a skill dir into ~/.agents/skills/ (owned by the user) and +# symlinks ~/.claude/skills/ -> ../../.agents/skills/ (the layout `skills add -g` +# produces; Claude Code reads ~/.claude/skills/). Scoped to SKILL_USERS. if-absent keys on the +# user's OWN copy, so it heals a stale/cross-user ~/.claude/skills symlink but never clobbers a real +# skill dir. Best-effort tail: must return 0 or set -euo pipefail aborts the whole reconcile. +install_skills() { + local user="$1" home + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" ]] || return 0 + case " $SKILL_USERS " in *" $user "*) ;; *) return 0 ;; esac + local src_root="$WORKSTATION_DIR/claude-skills" + [[ -d "$src_root" ]] || { log "WARN: $src_root missing -> skip skills for $user"; return 0; } + + if [[ "$DRY_RUN" == 1 ]]; then + local d names="" + for d in "$src_root"/*/; do [[ -d "$d" ]] && names+="$(basename "$d") "; done + echo "[dry-run] vendor skills if-absent -> $user: ${names}" + return 0 + fi + + local agents_dir="$home/.agents/skills" claude_dir="$home/.claude/skills" + # own the parent ~/.agents too (install -d leaves created intermediates root-owned) + install -d -o "$user" -g "$user" -m 0755 "$home/.agents" "$agents_dir" "$claude_dir" + chown "$user:$user" "$home/.agents" || true + + local skill name dst link n=0 + for skill in "$src_root"/*/; do + [[ -d "$skill" ]] || continue + name="$(basename "$skill")" + dst="$agents_dir/$name" + link="$claude_dir/$name" + # if-absent keys on the user's OWN copy (a real dir under ~/.agents/skills), NOT on any + # pre-existing ~/.claude/skills entry — so a stale or cross-user symlink gets healed. + if [[ ! -d "$dst" ]]; then + cp -a "$src_root/$name" "$dst" || { log "WARN: copy skill $name -> $user failed"; continue; } + chown -R "$user:$user" "$dst" || true + n=$((n+1)) + fi + # point ~/.claude/skills/ at the user's own copy (replacing a stale/cross-user symlink); + # never clobber a real dir/file squatting that name. + if [[ -d "$link" && ! -L "$link" ]]; then + log "WARN: $claude_dir/$name is a real dir (left as-is) for $user" + elif [[ "$(readlink "$link" 2>/dev/null)" != "../../.agents/skills/$name" ]]; then + ln -sfn "../../.agents/skills/$name" "$link" && chown -h "$user:$user" "$link" || log "WARN: link skill $name -> $user failed" + fi + done + if [[ "$n" -gt 0 ]]; then log "vendored/healed $n skill(s) -> $user"; fi + return 0 # best-effort tail must never return non-zero, else set -euo pipefail aborts the reconcile +} + [[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; } for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done [[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; } + +# 0) self-deploy: the repo is the authoring surface (like sync_managed_config / +# deploy_user_launcher below). Nothing else redeploys /usr/local/bin (only the +# manual setup-devvm.sh did) — so a committed edit silently never reached the +# hourly run until now (the homelab-memory rollout sat undeployed for a day). +# If the repo copy differs, install it and re-exec the fresh binary. Guarded: +# re-exec flag (no loop), bash -n (never deploy a broken script), DRY_RUN (no +# mutation), cmp (no churn when unchanged). +SELF_SRC="$WORKSTATION_DIR/../t3-provision-users.sh" +SELF_DST=/usr/local/bin/t3-provision-users +if [[ -z "${T3_PROVISION_SELF_DEPLOYED:-}" && -r "$SELF_SRC" ]] && ! cmp -s "$SELF_SRC" "$SELF_DST"; then + if [[ "$DRY_RUN" == 1 ]]; then + echo "[dry-run] self-deploy $SELF_DST from repo (changed)" + elif bash -n "$SELF_SRC" 2>/dev/null; then + install -m 0755 "$SELF_SRC" "$SELF_DST" + log "self-deployed $SELF_DST from repo (changed) — re-exec" + exec env T3_PROVISION_SELF_DEPLOYED=1 "$SELF_DST" "$@" + else + log "WARN: repo t3-provision-users.sh fails 'bash -n' — keeping deployed copy" + fi +fi + install -d -m 0755 "$ENVDIR" # 1) current sticky ports from existing .env files -> {os_user: port} @@ -421,7 +575,7 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do log "add $os_user -> group $g"; run gpasswd -a "$os_user" "$g" >/dev/null done fi - if [[ "$tier" != admin ]]; then # non-admins: locked clone(s) (kept fresh) + kubeconfig + shared Claude token + if [[ "$tier" != admin ]]; then # non-admins: locked clone(s) (kept fresh) + kubeconfig if [[ "$code_layout" == workspace ]]; then ensure_workspace_layout "$os_user" install_locked_clone "$os_user" code/infra @@ -440,17 +594,20 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do refresh_user_clone "$os_user" code fi install_user_kubeconfig "$os_user" - install_user_claude_token "$os_user" deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts) fi refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd install_user_claude_native "$os_user" # all tiers — per-user native claude (terminal + t3); no npm/npx + install_claude_auth_sync "$os_user" # all tiers — own Claude identity + isolated Vault recovery done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (if (.groups|length)==0 then "-" else (.groups|join(",")) end), .code_layout, (if (.repos|length)==0 then "-" else (.repos|join(",")) end)] | @tsv' "$desired_file") # 5) per-user .env (sticky port) + enable t3-serve@ while IFS=$'\t' read -r os_user port; do envf="$ENVDIR/$os_user.env" - env_set "$envf" T3_PORT "$port" # update-or-append; preserves CLAUDE_CODE_OAUTH_TOKEN + env_set "$envf" T3_PORT "$port" + # Per-user Enterprise login is authoritative. A legacy shared setup-token has + # higher credential precedence and would silently defeat user isolation. + env_unset "$envf" CLAUDE_CODE_OAUTH_TOKEN id "$os_user" >/dev/null 2>&1 && run systemctl enable --now "t3-serve@$os_user.service" >/dev/null 2>&1 || true done < <(jq -r '.ports | to_entries[] | [.key, .value] | @tsv' "$desired_file") @@ -464,6 +621,21 @@ while IFS=$'\t' read -r os_user pw_port; do install_playwright "$os_user" done < <(jq -r '.playwright_ports | to_entries[] | [.key, .value] | @tsv' "$desired_file") +# 5d) per-user homelab-memory (ALL users): replace the claude-memory MCP/plugin with the +# homelab CLI memory hooks. Idempotent + additive + if-absent; never touches the +# per-user MEMORY_API_KEY or other MCP servers (removes ONLY claude_memory). +while IFS=$'\t' read -r os_user; do + id "$os_user" >/dev/null 2>&1 || continue + install_memory "$os_user" +done < <(jq -r '.accounts[].os_user' "$desired_file") + +# 5e) per-user agent skills (SKILL_USERS allowlist only): vendored snapshot -> ~/.agents/skills +# + ~/.claude/skills symlinks. if-absent + additive; best-effort (never aborts the reconcile). +while IFS=$'\t' read -r os_user; do + id "$os_user" >/dev/null 2>&1 || continue + install_skills "$os_user" +done < <(jq -r '.accounts[].os_user' "$desired_file") + # 5b) machine-wide (once, not per-user): keep the t3 gated nightly TRACKER timer enabled (it # follows t3@nightly daily, gated; see t3-autoupdate.sh / docs/runbooks/t3-version-bump.md). # NEVER --now: the tracker installs a NEW build + migrates DBs + restarts serves, so firing diff --git a/scripts/t3-safe-restart.sh b/scripts/t3-safe-restart.sh new file mode 100644 index 00000000..63a6c455 --- /dev/null +++ b/scripts/t3-safe-restart.sh @@ -0,0 +1,96 @@ +#!/usr/bin/env bash +# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh +# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer). +# +# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing -> +# recover (restore DB + roll global binary back to last-good + freeze) — extracted +# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on. +# The only change from the inline original: safe_restart_unit RETURNS non-zero on +# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER +# decides what to do (the daily job exits; the idle job stops draining). +# +# Callers must set, before calling safe_restart_unit: $target (version being moved +# TO, for log lines + the prebump filename) and $last_good (rollback target). +# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle"). + +# ---- shared config defaults (honour the original T3_* override names) ----------- +: "${LOG_TAG:=t3-safe-restart}" +: "${FREEZE_FILE:=${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}}" +: "${STATE_DIR:=${T3_STATE_DIR:-/var/lib/t3-autoupdate}}" +: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}" +: "${DEFER_DIR:=$STATE_DIR/deferred}" +: "${BACKUP_DIR:=${T3_BACKUP_DEST:-/var/backups/t3-state}}" +: "${DISPATCH:=${T3_DISPATCH:-127.0.0.1:3780}}" +: "${USER_MAP:=${T3_USER_MAP:-/etc/ttyd-user-map}}" +: "${T3_BACKUP_TIMEOUT:=900}" + +LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; } +ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; } +# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line). +osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; } +# authentik username for an OS user (reverse map; first match) — for dispatch verify. +ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; } + +# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the +# WAL stays owned; never stops the serve). Uses global $target for the filename. +# Echoes the backup path on success; non-zero on failure. +backup_user() { + local u="$1" src out dst ts + src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1 + ts="$(date +%Y%m%d-%H%M%S)" + out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite" + install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out" + if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then + printf '%s\n' "$dst"; return 0 + fi + rm -f "$dst"; return 1 +} + +# newest pre-bump backup for a user taken for the current $target (restore source). +prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; } + +# roll the GLOBAL binary back to last-good. In the idle path last_good==installed, +# so this is a harmless no-op reinstall (does NOT downgrade other users). +rollback_binary() { + LOG "rolling back binary $target -> $last_good" + if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi + LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1 +} + +# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie). +verify_pairing() { + local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; } + out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)" + printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session=' +} + +# safe_restart_unit : restart the unit, verify pairing; on failure +# restore the user's DB from its pre-restart backup, roll the binary back, freeze. +# Assumes a pre-restart backup already exists for at the current $target +# (the daily job's backup_all, or the idle job's backup_user, takes it first). +# Returns 0 on verified success, non-zero after recovery+freeze on failure. +safe_restart_unit() { + local unit="$1" u="$2" ok=0 _ bak + systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero" + for _ in $(seq 1 15); do + if verify_pairing "$u"; then ok=1; break; fi + sleep 2 + done + if [ "$ok" = "1" ]; then + LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0 + fi + LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB" + rollback_binary + bak="$(prebump_of "$u")" + if [ -n "$bak" ]; then + systemctl stop "$unit" 2>/dev/null + if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then + rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm" + LOG "restored $u state.sqlite from $bak" + fi + systemctl start "$unit" 2>/dev/null + fi + touch "$FREEZE_FILE" 2>/dev/null + LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume" + return 1 +} diff --git a/scripts/test-claude-auth-sync.sh b/scripts/test-claude-auth-sync.sh new file mode 100755 index 00000000..10f07746 --- /dev/null +++ b/scripts/test-claude-auth-sync.sh @@ -0,0 +1,32 @@ +#!/usr/bin/env bash +set -uo pipefail +DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=workstation/claude-auth-sync.sh +source "$DIR/workstation/claude-auth-sync.sh" + +pass=0 fail=0 +ok() { if "${@:2}"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $1"; fi; } +no() { if "${@:2}"; then fail=$((fail+1)); echo "FAIL: $1"; else pass=$((pass+1)); fi; } +eq() { if [[ "$2" == "$3" ]]; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $1"; fi; } + +tmp="$(mktemp -d)"; trap 'rm -rf "$tmp"' EXIT +valid='{"mcpOAuth":{"server":{"accessToken":"mcp-secret"}},"claudeAiOauth":{"accessToken":"access","refreshToken":"refresh","expiresAt":123,"scopes":["user:inference"]}}' +printf '%s\n' "$valid" > "$tmp/credentials.json" + +oauth="$(cas_oauth_from_credentials "$tmp/credentials.json")" +eq "extract OAuth object" 'access' "$(jq -r .accessToken <<<"$oauth")" +printf '{"claudeAiOauth":{"accessToken":"access","expiresAt":123}}\n' > "$tmp/bad.json" +no "reject missing refresh token" cas_oauth_from_credentials "$tmp/bad.json" + +replacement='{"accessToken":"new-access","refreshToken":"new-refresh","expiresAt":456}' +merged="$(cas_merge_oauth "$tmp/credentials.json" "$replacement")" +eq "replace Claude access token" new-access "$(jq -r .claudeAiOauth.accessToken <<<"$merged")" +eq "preserve MCP OAuth" mcp-secret "$(jq -r '.mcpOAuth.server.accessToken' <<<"$merged")" + +export CAS_USER=emo +ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-emo +no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca +no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca + +printf '\n%d passed, %d failed\n' "$pass" "$fail" +(( fail == 0 )) diff --git a/scripts/test_tg_lock_timeout.py b/scripts/test_tg_lock_timeout.py new file mode 100644 index 00000000..263e5a74 --- /dev/null +++ b/scripts/test_tg_lock_timeout.py @@ -0,0 +1,102 @@ +#!/usr/bin/env python3 +"""Tests for scripts/tg lock-timeout injection. + +scripts/tg wraps terragrunt. Tier-1 stacks rely on terraform's pg-backend +state lock; without -lock-timeout an apply fails instantly ("Error acquiring +the state lock") whenever anything else holds the lock — a Woodpecker-killed +run whose PG advisory lock has not been reaped yet, a concurrent local apply, +or the daily drift `plan`. This was the single largest cause of infra CI +failures. These tests pin that tg injects -lock-timeout for state-locking +verbs (and still preserves -auto-approve for non-interactive applies), so a +contended lock waits rather than fails. + +Hermetic: a stub `terragrunt` on PATH records the args tg forwards; PG_CONN_STR +is pre-set so the Tier-1 Vault credential fetch is skipped (no network/Vault). +""" +import os +import shutil +import subprocess +from pathlib import Path + +import pytest + +SCRIPTS_DIR = Path(__file__).resolve().parent +TG = SCRIPTS_DIR / "tg" +AUTH_CHECK = SCRIPTS_DIR / "check-ingress-auth-comments.py" + + +def _run(tmp_path, *tg_args, env_extra=None): + """Run a copy of scripts/tg in an isolated fake repo; return forwarded args.""" + repo = tmp_path / "repo" + (repo / "scripts").mkdir(parents=True) + shutil.copy(TG, repo / "scripts" / "tg") + shutil.copy(AUTH_CHECK, repo / "scripts" / "check-ingress-auth-comments.py") + os.chmod(repo / "scripts" / "tg", 0o755) + os.chmod(repo / "scripts" / "check-ingress-auth-comments.py", 0o755) + + # Fake Tier-1 stack ("faketest" is NOT in TIER0_STACKS), no ingress auth lines. + stack = repo / "stacks" / "faketest" + stack.mkdir(parents=True) + (stack / "terragrunt.hcl").write_text("# fake\n") + (stack / "main.tf").write_text("# no ingress_factory auth lines here\n") + + # Stub terragrunt: append every forwarded arg (one per line) to a capture file. + bindir = tmp_path / "bin" + bindir.mkdir() + capture = tmp_path / "tg_args.txt" + stub = bindir / "terragrunt" + stub.write_text( + "#!/usr/bin/env bash\n" + f'for a in "$@"; do echo "$a" >> "{capture}"; done\n' + "exit 0\n" + ) + os.chmod(stub, 0o755) + + env = dict(os.environ) + env["PATH"] = f"{bindir}:{env['PATH']}" + env["PG_CONN_STR"] = "postgres://stub" # skip the Tier-1 Vault cred fetch + env["TF_PLUGIN_CACHE_DIR"] = str(tmp_path / "plugin-cache") + if env_extra: + env.update(env_extra) + + proc = subprocess.run( + ["bash", str(repo / "scripts" / "tg"), *tg_args], + cwd=str(stack), + env=env, + capture_output=True, + text=True, + ) + assert proc.returncode == 0, f"tg exited {proc.returncode}\nSTDERR:\n{proc.stderr}\nSTDOUT:\n{proc.stdout}" + return capture.read_text().splitlines() if capture.exists() else [] + + +def test_apply_non_interactive_has_lock_timeout_and_auto_approve(tmp_path): + args = _run(tmp_path, "apply", "--non-interactive") + assert "apply" in args + assert "-auto-approve" in args, "non-interactive apply must keep -auto-approve" + assert "-lock-timeout=5m" in args, "apply must wait for a contended state lock" + + +def test_plan_has_lock_timeout_but_not_auto_approve(tmp_path): + args = _run(tmp_path, "plan") + assert "plan" in args + assert "-lock-timeout=5m" in args + assert "-auto-approve" not in args, "plan must never get -auto-approve" + + +@pytest.mark.parametrize("verb", ["destroy", "refresh"]) +def test_locking_verb_gets_lock_timeout(tmp_path, verb): + args = _run(tmp_path, verb) + assert "-lock-timeout=5m" in args, f"{verb} should carry -lock-timeout" + + +def test_non_locking_verb_has_no_lock_timeout(tmp_path): + # validate does not take a state lock — must not carry -lock-timeout. + args = _run(tmp_path, "validate") + assert "validate" in args + assert not any(a.startswith("-lock-timeout") for a in args) + + +def test_lock_timeout_is_env_overridable(tmp_path): + args = _run(tmp_path, "plan", env_extra={"TG_LOCK_TIMEOUT": "2m"}) + assert "-lock-timeout=2m" in args diff --git a/scripts/tg b/scripts/tg index b9e9f0da..b0574f89 100755 --- a/scripts/tg +++ b/scripts/tg @@ -13,6 +13,15 @@ export TF_PLUGIN_CACHE_DIR="${TF_PLUGIN_CACHE_DIR:-$HOME/.terraform.d/plugin-cac export TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1 mkdir -p "$TF_PLUGIN_CACHE_DIR" +# State-lock wait window. Tier-1 stacks lock their state via terraform's pg +# backend (pg_advisory_lock); with no timeout an apply fails instantly +# ("Error acquiring the state lock") the moment anything else holds the lock — +# a Woodpecker-killed run whose lock PG hasn't reaped yet, a concurrent local +# apply, or the daily drift `plan`. Waiting a few minutes absorbs all of those +# (the holder finishes, or PG reaps the dead backend). This was the #1 cause of +# infra CI failures. Override with TG_LOCK_TIMEOUT (e.g. 0 to fail fast). +LOCK_TIMEOUT="${TG_LOCK_TIMEOUT:-5m}" + # Determine stack name from cwd (relative to stacks/) STACK_NAME="" cwd="$(pwd)" @@ -134,29 +143,30 @@ if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then fi fi -# If running apply with --non-interactive, add -auto-approve for Terraform +# Build the terragrunt invocation: +# - add -auto-approve right after `apply` for --non-interactive runs (CI) +# - add -lock-timeout for state-locking verbs (plan/apply/destroy/refresh) so +# a contended state lock WAITS instead of failing instantly (see +# LOCK_TIMEOUT above). Non-locking verbs (init/validate/output/fmt) skip it. args=("$@") -has_apply=false has_non_interactive=false for arg in "${args[@]}"; do case "$arg" in - apply) has_apply=true ;; --non-interactive) has_non_interactive=true ;; esac done -if $has_apply && $has_non_interactive; then - new_args=() - for arg in "${args[@]}"; do - new_args+=("$arg") - if [ "$arg" = "apply" ]; then - new_args+=("-auto-approve") - fi - done - terragrunt "${new_args[@]}" -else - terragrunt "$@" +tg_args=() +for arg in "${args[@]}"; do + tg_args+=("$arg") + if [ "$arg" = "apply" ] && $has_non_interactive; then + tg_args+=("-auto-approve") + fi +done +if $is_tf_op; then + tg_args+=("-lock-timeout=$LOCK_TIMEOUT") fi +terragrunt "${tg_args[@]}" # After mutating operations: encrypt+commit (Tier 0) or no-op (Tier 1 — PG is authoritative) if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then diff --git a/scripts/workstation/claude-auth-sync.sh b/scripts/workstation/claude-auth-sync.sh new file mode 100755 index 00000000..dc3d780d --- /dev/null +++ b/scripts/workstation/claude-auth-sync.sh @@ -0,0 +1,153 @@ +#!/usr/bin/env bash +# Keep one Workstation user's Claude subscription OAuth credentials recoverable. +# Claude owns access/refresh-token rotation in ~/.claude/.credentials.json. This +# helper validates auth with real inference, stores only the claudeAiOauth object +# in the user's isolated Vault path, and attempts one restore on failure. +set -euo pipefail + +CAS_USER="${CLAUDE_AUTH_USER:-$(id -un)}" +CAS_HOME="${HOME:?HOME must be set}" +CAS_CREDENTIALS="${CLAUDE_CREDENTIALS_FILE:-$CAS_HOME/.claude/.credentials.json}" +CAS_CONFIG_DIR="${CLAUDE_AUTH_CONFIG_DIR:-$CAS_HOME/.config/claude-auth-sync}" +CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-token}" +CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}" +CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}" +CAS_LOG="$CAS_STATE_DIR/sync.log" + +cas_log() { + mkdir -p "$CAS_STATE_DIR" + printf '%s %s\n' "$(date -Is)" "$*" >> "$CAS_LOG" + logger -t claude-auth-sync -- "user=$CAS_USER $*" 2>/dev/null || true +} + +# Print the Claude OAuth object, or fail without exposing any token material. +cas_oauth_from_credentials() { + jq -ce '.claudeAiOauth + | select((.accessToken | type) == "string" and (.accessToken | length) > 0) + | select((.refreshToken | type) == "string" and (.refreshToken | length) > 0) + | select((.expiresAt | type) == "number")' "$1" +} + +# Merge a recovered OAuth object while preserving unrelated credentials (MCP OAuth). +cas_merge_oauth() { + local credentials="$1" oauth="$2" + jq -ce --argjson oauth "$oauth" '.claudeAiOauth = $oauth' "$credentials" +} + +cas_vault_identity_ok() { + local display_name="$1" policies_csv="$2" + [[ "$display_name" == "token-devvm-claude-auth-$CAS_USER" ]] || return 1 + printf ',%s,' "$policies_csv" | grep -q ",workstation-claude-$CAS_USER," +} + +cas_prepare_vault() { + [[ -s "$CAS_VAULT_TOKEN_FILE" ]] || { + cas_log "FAIL missing scoped Vault token; admin must run workstation provisioning" + return 1 + } + export VAULT_ADDR="${VAULT_ADDR:-https://vault.viktorbarzin.me}" + VAULT_TOKEN="$(<"$CAS_VAULT_TOKEN_FILE")"; export VAULT_TOKEN + + local info display_name policies + info="$(vault token lookup -format=json 2>/dev/null)" || { + cas_log "FAIL scoped Vault token lookup failed" + return 1 + } + display_name="$(jq -r '.data.display_name // ""' <<<"$info")" + policies="$(jq -r '((.data.policies // []) + (.data.identity_policies // [])) | join(",")' <<<"$info")" + cas_vault_identity_ok "$display_name" "$policies" || { + cas_log "FAIL scoped Vault token drift detected; refusing foreign token" + return 1 + } + vault token renew -format=json >/dev/null 2>&1 || { + cas_log "FAIL scoped Vault token renewal failed" + return 1 + } +} + +# auth status is not authoritative: it reported loggedIn=true during a real 401 +# on 2026-06-20. A tiny, non-persistent inference is the feedback loop. +cas_live_auth_ok() { + local out + out="$(timeout 60 claude -p 'Reply with exactly AUTH_OK and nothing else.' \ + --model haiku --max-turns 1 --no-session-persistence --tools "" \ + --disable-slash-commands --setting-sources "" 2>/dev/null)" || return 1 + [[ "$out" == "AUTH_OK" ]] +} + +cas_backup() { + local oauth expires + oauth="$(cas_oauth_from_credentials "$CAS_CREDENTIALS")" || { + cas_log "FAIL local Claude OAuth credential is absent or malformed" + return 1 + } + expires="$(jq -r '.expiresAt' <<<"$oauth")" + vault kv put "$CAS_VAULT_PATH" \ + claude_ai_oauth_json="$oauth" \ + credential_expires_at_ms="$expires" \ + backed_up_at="$(date -Is)" >/dev/null || { + cas_log "FAIL Vault credential backup failed" + return 1 + } + cas_log "OK Claude auth valid; refreshed OAuth state backed up to Vault" +} + +cas_restore() { + local oauth base tmp + oauth="$(vault kv get -field=claude_ai_oauth_json "$CAS_VAULT_PATH" 2>/dev/null)" || { + cas_log "FAIL no recoverable Claude OAuth credential in Vault" + return 1 + } + jq -e 'select((.accessToken | type) == "string" and (.accessToken | length) > 0) + | select((.refreshToken | type) == "string" and (.refreshToken | length) > 0) + | select((.expiresAt | type) == "number")' <<<"$oauth" >/dev/null || { + cas_log "FAIL Vault Claude OAuth credential is malformed" + return 1 + } + + mkdir -p "$(dirname "$CAS_CREDENTIALS")" + if jq -e 'type == "object"' "$CAS_CREDENTIALS" >/dev/null 2>&1; then + base="$CAS_CREDENTIALS" + else + base="$(mktemp)"; printf '{}\n' > "$base" + fi + tmp="$(mktemp "${CAS_CREDENTIALS}.XXXXXX")" + if ! cas_merge_oauth "$base" "$oauth" > "$tmp"; then + rm -f "$tmp"; [[ "$base" == "$CAS_CREDENTIALS" ]] || rm -f "$base" + cas_log "FAIL could not merge Vault Claude OAuth credential" + return 1 + fi + chmod 0600 "$tmp" + mv "$tmp" "$CAS_CREDENTIALS" + [[ "$base" == "$CAS_CREDENTIALS" ]] || rm -f "$base" + cas_log "RECOVERED restored Claude OAuth state from Vault" +} + +cas_main() { + umask 077 + for bin in jq vault claude timeout flock; do + command -v "$bin" >/dev/null || { cas_log "FAIL missing dependency: $bin"; return 1; } + done + mkdir -p "$CAS_STATE_DIR" + exec 9>"$CAS_STATE_DIR/lock" + flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; } + + cas_prepare_vault || return 1 + if cas_live_auth_ok; then + cas_backup + return + fi + + cas_log "WARN live Claude auth failed; attempting one Vault restore" + cas_restore || return 1 + if cas_live_auth_ok; then + cas_backup + return + fi + cas_log "FAIL Claude auth still invalid after Vault restore; interactive SSO login required" + return 1 +} + +if [[ "${BASH_SOURCE[0]}" == "$0" ]]; then + cas_main "$@" +fi diff --git a/scripts/workstation/claude-hooks/auto-learn.py b/scripts/workstation/claude-hooks/auto-learn.py new file mode 100755 index 00000000..174431f9 --- /dev/null +++ b/scripts/workstation/claude-hooks/auto-learn.py @@ -0,0 +1,184 @@ +#!/usr/bin/env python3 +""" +Stop hook (async): automatic learning extraction via haiku-as-judge. + +After each Claude response, sends the user message + assistant response to +haiku to detect corrections, preferences, decisions, or facts worth storing. +If learning events are detected, stores them via the `homelab memory` CLI — the +only sanctioned memory path on the devvm (no direct HTTP, no local SQLite). + +Runs with async: true — does NOT block the user. +""" + +import io +import json +import logging +import os +import shutil +import subprocess +import sys + +logger = logging.getLogger(__name__) + +JUDGE_PROMPT = """You are a memory extraction judge. Analyze this exchange between a user and an AI assistant. + +USER MESSAGE: +{user_message} + +ASSISTANT RESPONSE: +{assistant_response} + +Your job: determine if any of these learning events occurred: +1. USER CORRECTION — user corrected the assistant's mistake or misunderstanding +2. PREFERENCE — user stated a preference, habit, or "I like/prefer/want" statement +3. DECISION — a decision was reached about how to do something +4. FACT — user shared a durable fact about themselves, their team, tools, or environment + +If ANY learning event occurred, return JSON: +{{"events": [{{"type": "correction|preference|decision|fact", "content": "concise fact to remember (one sentence)", "importance": 0.7, "expanded_keywords": "space-separated semantically related search terms for recall (minimum 5 words)", "supersedes": null}}]}} + +If NO learning event occurred, return: +{{"events": []}} + +Rules: +- Only extract DURABLE facts, not transient task details +- Corrections are highest value (0.8-0.9) +- Be conservative — false negatives are better than false positives +- "expanded_keywords" should include synonyms, related concepts, and adjacent topics that would help find this memory later +- "supersedes" should be a search query to find the old outdated memory, or null +- Return ONLY valid JSON, no other text""" + + +def _store_via_homelab_cli(content, category, tags, importance, expanded_keywords): + """Store one memory via the homelab CLI — the only sanctioned memory path on + the devvm (no direct HTTP, no local SQLite). The CLI defaults the API URL and + reads CLAUDE_MEMORY_API_KEY / MEMORY_API_KEY from the environment; if neither + is set (e.g. a user without a minted key) it no-ops silently.""" + homelab = shutil.which("homelab") or "/usr/local/bin/homelab" + if not os.path.exists(homelab): + return + if not (os.environ.get("CLAUDE_MEMORY_API_KEY") or os.environ.get("MEMORY_API_KEY")): + return + cmd = [ + homelab, "memory", "store", content, + "--category", category, + "--tags", tags, + "--importance", str(importance), + ] + if expanded_keywords: + # CLI wants comma-separated keywords; the judge emits space-separated terms. + keywords = ",".join(expanded_keywords.replace(",", " ").split()) + if keywords: + cmd += ["--keywords", keywords] + subprocess.run(cmd, capture_output=True, text=True, timeout=15, env=os.environ) + + +def main() -> None: + # Graceful exit if claude CLI is not available + if not shutil.which("claude"): + return + + try: + hook_input = json.load(sys.stdin) + except (json.JSONDecodeError, EOFError): + return + + if isinstance(hook_input, dict) and hook_input.get("stop_hook_active", False): + return + + transcript_path = "" + if isinstance(hook_input, dict): + transcript_path = hook_input.get("transcript_path", "") + + if not transcript_path or not os.path.exists(transcript_path): + return + + user_message = "" + assistant_response = "" + try: + MAX_TAIL_BYTES = 50_000 + with open(transcript_path, "rb") as f: + f.seek(0, io.SEEK_END) + size = f.tell() + f.seek(max(0, size - MAX_TAIL_BYTES)) + tail = f.read().decode("utf-8", errors="replace") + lines = tail.split("\n") + + for line in reversed(lines): + line = line.strip() + if not line: + continue + try: + entry = json.loads(line) + except json.JSONDecodeError: + continue + role = entry.get("role", "") + content = entry.get("content", "") + if isinstance(content, list): + content = " ".join( + b.get("text", "") for b in content + if isinstance(b, dict) and b.get("type") == "text" + ) + content = str(content)[:2000] + if role == "assistant" and not assistant_response: + assistant_response = content + elif role == "user" and not user_message: + user_message = content + if user_message and assistant_response: + break + except Exception: + return + + if not user_message or len(user_message.strip()) < 10: + return + + prompt = JUDGE_PROMPT.format( + user_message=user_message, + assistant_response=assistant_response[:1000], + ) + + try: + result = subprocess.run( + ["claude", "-p", prompt, "--model", "haiku"], + capture_output=True, text=True, timeout=30, + env={**os.environ, "CLAUDECODE": ""}, + ) + if result.returncode != 0: + return + response_text = result.stdout.strip() + if response_text.startswith("```"): + lines = response_text.split("\n") + lines = [l for l in lines if not l.strip().startswith("```")] + response_text = "\n".join(lines).strip() + judge_result = json.loads(response_text) + events = judge_result.get("events", []) + if not events: + return + except (subprocess.TimeoutExpired, json.JSONDecodeError, OSError): + return + + category_map = { + "correction": "preferences", + "preference": "preferences", + "decision": "decisions", + "fact": "facts", + } + + for event in events: + content = event.get("content", "") + if not content: + continue + event_type = event.get("type", "fact") + importance = max(0.0, min(1.0, float(event.get("importance", 0.7)))) + category = category_map.get(event_type, "facts") + tags = f"auto-learned,{event_type}" + expanded_keywords = event.get("expanded_keywords", "") + + try: + _store_via_homelab_cli(content, category, tags, importance, expanded_keywords) + except Exception: + pass # Never crash the async hook + + +if __name__ == "__main__": + main() diff --git a/scripts/workstation/claude-hooks/homelab-memory-recall.py b/scripts/workstation/claude-hooks/homelab-memory-recall.py new file mode 100755 index 00000000..7315f116 --- /dev/null +++ b/scripts/workstation/claude-hooks/homelab-memory-recall.py @@ -0,0 +1,70 @@ +#!/usr/bin/env python3 +"""UserPromptSubmit hook: inject relevant memories via `homelab memory recall`. + +Replaces the claude-memory MCP recall path. Instead of instructing the model to +call the memory_recall MCP tool, this hook runs the homelab CLI (a direct client +to the same claude-memory HTTP API) and injects the ACTUAL results as context — +so recall is automatic, needs no model tool-call, and works with the MCP +uninstalled. Best-effort: any failure exits 0 silently (recall just doesn't +happen that turn, exactly like the MCP being unavailable). + +Wizard-only trial of the MCP deprecation (2026-06-20). Reversible: restore the +plugin command in ~/.claude/settings.json (backup: settings.json.bak-pre-homelab-memory). +""" + +import json +import os +import shutil +import subprocess +import sys + + +def main() -> None: + try: + hook_input = json.load(sys.stdin) + except (json.JSONDecodeError, EOFError): + return + + prompt = "" + if isinstance(hook_input, dict): + prompt = hook_input.get("prompt") or hook_input.get("user_prompt") or "" + if not prompt and isinstance(hook_input.get("content"), str): + prompt = hook_input["content"] + prompt = (prompt or "").strip() + + # Same gates as the original recall hook: skip short prompts, code/JSON/XML blobs. + if len(prompt) < 10 or prompt[0] in "`{<": + return + + homelab = shutil.which("homelab") or "/usr/local/bin/homelab" + if not os.path.exists(homelab): + return + if not (os.environ.get("CLAUDE_MEMORY_API_KEY") or os.environ.get("MEMORY_API_KEY")): + return + + try: + res = subprocess.run( + [homelab, "memory", "recall", prompt, "--limit", "5"], + capture_output=True, text=True, timeout=4, env=os.environ, + ) + except (subprocess.TimeoutExpired, OSError): + return + + out = (res.stdout or "").strip() + if res.returncode != 0 or not out: + return + + context = ( + "Relevant stored memories (via `homelab memory recall`) — incorporate " + "naturally if useful; do NOT mention this lookup to the user:\n\n" + out + ) + print(json.dumps({ + "hookSpecificOutput": { + "hookEventName": "UserPromptSubmit", + "additionalContext": context, + } + })) + + +if __name__ == "__main__": + main() diff --git a/scripts/workstation/claude-hooks/post-compact-recovery.sh b/scripts/workstation/claude-hooks/post-compact-recovery.sh new file mode 100755 index 00000000..4687d951 --- /dev/null +++ b/scripts/workstation/claude-hooks/post-compact-recovery.sh @@ -0,0 +1,64 @@ +#!/bin/bash +# UserPromptSubmit hook: Inject recovery context after compaction +# This hook runs on each user prompt, but only injects context once after compaction. + +# Read hook input from stdin +INPUT=$(cat) + +# Extract session ID +SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // .sessionId // "unknown"') + +# Define marker path +MEMORY_HOME="${MEMORY_HOME:-$HOME/.claude/claude-memory}" +MARKER_DIR="${MEMORY_HOME}/state/compaction-markers" +MARKER_FILE="${MARKER_DIR}/${SESSION_ID}.json" + +# Fast path: no marker means no recent compaction, exit immediately +if [ ! -f "$MARKER_FILE" ]; then + exit 0 +fi + +# Read marker contents +MARKER=$(cat "$MARKER_FILE") + +# Validate JSON before processing +if ! echo "$MARKER" | jq -e . >/dev/null 2>&1; then + rm -f "$MARKER_FILE" + exit 0 +fi + +# Extract data from marker +COMPACTED_AT=$(echo "$MARKER" | jq -r '.compactedAt // "unknown"') +PERSONALITY=$(echo "$MARKER" | jq -r '.personalityReminder // ""') + +# Build remembered facts summary (limit to ~500 chars) +FACTS_SUMMARY=$(echo "$MARKER" | jq -r ' + .rememberedFacts[:10] | + map("- [\(.category // "fact")] \(.content)") | + join("\n") +' 2>/dev/null || echo "") + +# Build recovery context (kept under 1000 tokens) +RECOVERY_CONTEXT="[Claude Memory Recovery - Context compacted at ${COMPACTED_AT}] + +${PERSONALITY} + +Key memories from before compaction: +${FACTS_SUMMARY} + +Use the memory_recall MCP tool if you need more context about past conversations." + +# Output JSON with additional context for injection +cat << EOF +{ + "hookSpecificOutput": { + "hookEventName": "UserPromptSubmit", + "additionalContext": $(echo "$RECOVERY_CONTEXT" | jq -Rs .) + } +} +EOF + +# Delete marker file (one-time injection) +rm -f "$MARKER_FILE" + +exit 0 diff --git a/scripts/workstation/claude-hooks/pre-compact-backup.sh b/scripts/workstation/claude-hooks/pre-compact-backup.sh new file mode 100755 index 00000000..1194b12d --- /dev/null +++ b/scripts/workstation/claude-hooks/pre-compact-backup.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# PreCompact hook: Save key memories before compaction +set -e + +INPUT=$(cat) +SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // .sessionId // "unknown"') + +MEMORY_HOME="${MEMORY_HOME:-$HOME/.claude/claude-memory}" +MARKER_DIR="${MEMORY_HOME}/state/compaction-markers" +MEMORY_DB="${MEMORY_HOME}/memory/memory.db" +MARKER_FILE="${MARKER_DIR}/${SESSION_ID}.json" + +mkdir -p "$MARKER_DIR" + +TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + +# Try API first, fall back to SQLite +REMEMBERED_FACTS="[]" +if [ -n "${MEMORY_API_KEY:-${CLAUDE_MEMORY_API_KEY:-}}" ]; then + API_KEY="${MEMORY_API_KEY:-${CLAUDE_MEMORY_API_KEY:-}}" + API_URL="${MEMORY_API_URL:-${CLAUDE_MEMORY_API_URL:-}}" + if [ -n "$API_URL" ]; then + REMEMBERED_FACTS=$(curl -sf -H "Authorization: Bearer ${API_KEY}" \ + "${API_URL}/api/memories?limit=20" 2>/dev/null | \ + jq '[.memories[] | {content, category, importance}]' 2>/dev/null || echo "[]") + fi +elif [ -f "$MEMORY_DB" ]; then + REMEMBERED_FACTS=$(sqlite3 -json "$MEMORY_DB" \ + "SELECT content, category, importance FROM memories ORDER BY importance DESC, created_at DESC LIMIT 20" 2>/dev/null || echo "[]") +fi + +if ! echo "$REMEMBERED_FACTS" | jq empty 2>/dev/null; then + REMEMBERED_FACTS="[]" +fi + +jq -n \ + --arg sid "$SESSION_ID" \ + --arg ts "$TIMESTAMP" \ + --argjson facts "$REMEMBERED_FACTS" \ + '{sessionId: $sid, compactedAt: $ts, rememberedFacts: $facts}' \ + > "$MARKER_FILE" + +exit 0 diff --git a/scripts/workstation/claude-hooks/wire-memory-hooks.py b/scripts/workstation/claude-hooks/wire-memory-hooks.py new file mode 100644 index 00000000..c33b504c --- /dev/null +++ b/scripts/workstation/claude-hooks/wire-memory-hooks.py @@ -0,0 +1,90 @@ +#!/usr/bin/env python3 +"""Wire the homelab-memory hooks into a user's ~/.claude/settings.json. + +Part of the claude-memory MCP -> homelab CLI migration (all-users rollout). +Two passes, idempotent, never touching `env` (the per-user MEMORY_API_KEY) or any +other setting: + (0) PRUNE any hook command still pointing at the retired claude-memory plugin + (`plugins/claude-memory/hooks/`). install_memory() rm -rf's that dir, so + those entries are dangling — and a missing UserPromptSubmit hook exits 2, + a BLOCKING error that erases the prompt and freezes the session (devvm emo + incident 2026-06-22). Must run BEFORE the additive pass: the plugin shares + basenames with the homelab hooks, so without pruning, the "already present" + check below matches the dead plugin path and skips the real install. + (1) ADD each homelab hook group when no existing command references its script. + +Usage: wire-memory-hooks.py +Exit 0 on success (changed or already-present); 1 only on an unreadable settings file. +""" +import json +import os +import sys + +home = sys.argv[1] +settings = os.path.join(home, ".claude", "settings.json") +hooks_dir = os.path.join(home, ".claude", "hooks") + +# (event, script-basename used for the if-absent check, full command, extra fields) +WANT = [ + ("PreCompact", "pre-compact-backup.sh", f"{hooks_dir}/pre-compact-backup.sh", {"timeout": 30}), + ("UserPromptSubmit", "post-compact-recovery.sh", f"{hooks_dir}/post-compact-recovery.sh", {"timeout": 10}), + ("UserPromptSubmit", "homelab-memory-recall.py", f"python3 {hooks_dir}/homelab-memory-recall.py", {"timeout": 8}), + ("Stop", "auto-learn.py", f"python3 {hooks_dir}/auto-learn.py", {"async": True}), +] + +try: + if os.path.exists(settings) and os.path.getsize(settings) > 0: + with open(settings) as fh: + data = json.load(fh) + else: + data = {} +except (json.JSONDecodeError, OSError) as e: + print(f"ERROR: cannot read {settings}: {e}", file=sys.stderr) + sys.exit(1) + +hooks = data.setdefault("hooks", {}) +changed = False + +# (0) Prune dead claude-memory plugin hooks (see module docstring). Must precede +# the additive pass so shared basenames don't mask a needed install. +DEAD_REF = "plugins/claude-memory/hooks/" +for event in list(hooks.keys()): + new_groups = [] + removed_any = False + for g in (hooks.get(event) or []): + original = g.get("hooks") or [] + kept = [h for h in original if DEAD_REF not in (h.get("command", "") or "")] + if len(kept) != len(original): + removed_any = True + if kept: + new_groups.append({**g, "hooks": kept}) + if removed_any: + changed = True + if new_groups: + hooks[event] = new_groups + else: + del hooks[event] + +# (1) Additively wire each homelab hook, if no command already references it. +for event, basename, command, extra in WANT: + groups = hooks.setdefault(event, []) + already = any( + basename in (h.get("command", "") or "") + for g in groups + for h in (g.get("hooks", []) or []) + ) + if already: + continue + entry = {"type": "command", "command": command} + entry.update(extra) + groups.append({"hooks": [entry]}) + changed = True + +if changed: + tmp = settings + ".tmp" + with open(tmp, "w") as fh: + json.dump(data, fh, indent=2) + os.replace(tmp, settings) + print(f"wired memory hooks -> {settings}") +else: + print(f"memory hooks already present -> {settings} (no change)") diff --git a/scripts/workstation/claude-skills/README.md b/scripts/workstation/claude-skills/README.md new file mode 100644 index 00000000..816cbcb7 --- /dev/null +++ b/scripts/workstation/claude-skills/README.md @@ -0,0 +1,31 @@ +# claude-skills — vendored agent-skill snapshot + +Point-in-time snapshot of the admin's (`wizard`) Claude Code agent skills, deployed +per-user by `install_skills()` in `../../t3-provision-users.sh` (scoped to the +`SKILL_USERS` allowlist). Each subdirectory is one skill (`SKILL.md` + any bundled +references). The provisioner copies a skill into `~/.agents/skills//` (owned by +the user) and symlinks `~/.claude/skills/ -> ../../.agents/skills/` — the +layout the `skills` CLI's `-g` install produces; Claude Code reads `~/.claude/skills/`. + +## Why vendored (not `npx skills add` at provision time) + +Upstream drifted from this set: on `mattpocock/skills` master, `diagnose` → +`diagnosing-bugs` and `write-a-skill` → `writing-great-skills` were renamed, and +`caveman` + `zoom-out` are no longer published — so `npx skills` cannot reproduce this +exact set. Vendoring is also offline/deterministic and keeps GitHub-clone + +unpinned-CLI dependencies out of the hourly **root** reconcile. + +## Sources + +- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills` +- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills` + +## Refreshing + +Re-snapshot from a current install and commit the diff: + +```sh +cp -a ~/.agents/skills/. scripts/workstation/claude-skills/ +``` + +Snapshot taken 2026-06-23. diff --git a/scripts/workstation/claude-skills/caveman/SKILL.md b/scripts/workstation/claude-skills/caveman/SKILL.md new file mode 100644 index 00000000..85770a38 --- /dev/null +++ b/scripts/workstation/claude-skills/caveman/SKILL.md @@ -0,0 +1,49 @@ +--- +name: caveman +description: > + Ultra-compressed communication mode. Cuts token usage ~75% by dropping + filler, articles, and pleasantries while keeping full technical accuracy. + Use when user says "caveman mode", "talk like caveman", "use caveman", + "less tokens", "be brief", or invokes /caveman. +--- + +Respond terse like smart caveman. All technical substance stay. Only fluff die. + +## Persistence + +ACTIVE EVERY RESPONSE once triggered. No revert after many turns. No filler drift. Still active if unsure. Off only when user says "stop caveman" or "normal mode". + +## Rules + +Drop: articles (a/an/the), filler (just/really/basically/actually/simply), pleasantries (sure/certainly/of course/happy to), hedging. Fragments OK. Short synonyms (big not extensive, fix not "implement a solution for"). Abbreviate common terms (DB/auth/config/req/res/fn/impl). Strip conjunctions. Use arrows for causality (X -> Y). One word when one word enough. + +Technical terms stay exact. Code blocks unchanged. Errors quoted exact. + +Pattern: `[thing] [action] [reason]. [next step].` + +Not: "Sure! I'd be happy to help you with that. The issue you're experiencing is likely caused by..." +Yes: "Bug in auth middleware. Token expiry check use `<` not `<=`. Fix:" + +### Examples + +**"Why React component re-render?"** + +> Inline obj prop -> new ref -> re-render. `useMemo`. + +**"Explain database connection pooling."** + +> Pool = reuse DB conn. Skip handshake -> fast under load. + +## Auto-Clarity Exception + +Drop caveman temporarily for: security warnings, irreversible action confirmations, multi-step sequences where fragment order risks misread, user asks to clarify or repeats question. Resume caveman after clear part done. + +Example -- destructive op: + +> **Warning:** This will permanently delete all rows in the `users` table and cannot be undone. +> +> ```sql +> DROP TABLE users; +> ``` +> +> Caveman resume. Verify backup exist first. diff --git a/scripts/workstation/claude-skills/diagnose/SKILL.md b/scripts/workstation/claude-skills/diagnose/SKILL.md new file mode 100644 index 00000000..ed55bda2 --- /dev/null +++ b/scripts/workstation/claude-skills/diagnose/SKILL.md @@ -0,0 +1,117 @@ +--- +name: diagnose +description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression. +--- + +# Diagnose + +A discipline for hard bugs. Skip phases only when explicitly justified. + +When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching. + +## Phase 1 — Build a feedback loop + +**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you. + +Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.** + +### Ways to construct one — try them in roughly this order + +1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e. +2. **Curl / HTTP script** against a running dev server. +3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot. +4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network. +5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation. +6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call. +7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode. +8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it. +9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs. +10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you. + +Build the right feedback loop, and the bug is 90% fixed. + +### Iterate on the loop itself + +Treat the loop as a product. Once you have _a_ loop, ask: + +- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.) +- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".) +- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.) + +A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower. + +### Non-deterministic bugs + +The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable. + +### When you genuinely cannot build a loop + +Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop. + +Do not proceed to Phase 2 until you have a loop you believe in. + +## Phase 2 — Reproduce + +Run the loop. Watch the bug appear. + +Confirm: + +- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix. +- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against). +- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it. + +Do not proceed until you reproduce the bug. + +## Phase 3 — Hypothesise + +Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea. + +Each hypothesis must be **falsifiable**: state the prediction it makes. + +> Format: "If is the cause, then will make the bug disappear / will make it worse." + +If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it. + +**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK. + +## Phase 4 — Instrument + +Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.** + +Tool preference: + +1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs. +2. **Targeted logs** at the boundaries that distinguish hypotheses. +3. Never "log everything and grep". + +**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die. + +**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second. + +## Phase 5 — Fix + regression test + +Write the regression test **before the fix** — but only if there is a **correct seam** for it. + +A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence. + +**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase. + +If a correct seam exists: + +1. Turn the minimised repro into a failing test at that seam. +2. Watch it fail. +3. Apply the fix. +4. Watch it pass. +5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario. + +## Phase 6 — Cleanup + post-mortem + +Required before declaring done: + +- [ ] Original repro no longer reproduces (re-run the Phase 1 loop) +- [ ] Regression test passes (or absence of seam is documented) +- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix) +- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location) +- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns + +**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started. diff --git a/scripts/workstation/claude-skills/diagnose/scripts/hitl-loop.template.sh b/scripts/workstation/claude-skills/diagnose/scripts/hitl-loop.template.sh new file mode 100644 index 00000000..40afc465 --- /dev/null +++ b/scripts/workstation/claude-skills/diagnose/scripts/hitl-loop.template.sh @@ -0,0 +1,41 @@ +#!/usr/bin/env bash +# Human-in-the-loop reproduction loop. +# Copy this file, edit the steps below, and run it. +# The agent runs the script; the user follows prompts in their terminal. +# +# Usage: +# bash hitl-loop.template.sh +# +# Two helpers: +# step "" → show instruction, wait for Enter +# capture VAR "" → show question, read response into VAR +# +# At the end, captured values are printed as KEY=VALUE for the agent to parse. + +set -euo pipefail + +step() { + printf '\n>>> %s\n' "$1" + read -r -p " [Enter when done] " _ +} + +capture() { + local var="$1" question="$2" answer + printf '\n>>> %s\n' "$question" + read -r -p " > " answer + printf -v "$var" '%s' "$answer" +} + +# --- edit below --------------------------------------------------------- + +step "Open the app at http://localhost:3000 and sign in." + +capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)" + +capture ERROR_MSG "Paste the error message (or 'none'):" + +# --- edit above --------------------------------------------------------- + +printf '\n--- Captured ---\n' +printf 'ERRORED=%s\n' "$ERRORED" +printf 'ERROR_MSG=%s\n' "$ERROR_MSG" diff --git a/scripts/workstation/claude-skills/find-skills/SKILL.md b/scripts/workstation/claude-skills/find-skills/SKILL.md new file mode 100644 index 00000000..114c6637 --- /dev/null +++ b/scripts/workstation/claude-skills/find-skills/SKILL.md @@ -0,0 +1,142 @@ +--- +name: find-skills +description: Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill. +--- + +# Find Skills + +This skill helps you discover and install skills from the open agent skills ecosystem. + +## When to Use This Skill + +Use this skill when the user: + +- Asks "how do I do X" where X might be a common task with an existing skill +- Says "find a skill for X" or "is there a skill for X" +- Asks "can you do X" where X is a specialized capability +- Expresses interest in extending agent capabilities +- Wants to search for tools, templates, or workflows +- Mentions they wish they had help with a specific domain (design, testing, deployment, etc.) + +## What is the Skills CLI? + +The Skills CLI (`npx skills`) is the package manager for the open agent skills ecosystem. Skills are modular packages that extend agent capabilities with specialized knowledge, workflows, and tools. + +**Key commands:** + +- `npx skills find [query]` - Search for skills interactively or by keyword +- `npx skills add ` - Install a skill from GitHub or other sources +- `npx skills check` - Check for skill updates +- `npx skills update` - Update all installed skills + +**Browse skills at:** https://skills.sh/ + +## How to Help Users Find Skills + +### Step 1: Understand What They Need + +When a user asks for help with something, identify: + +1. The domain (e.g., React, testing, design, deployment) +2. The specific task (e.g., writing tests, creating animations, reviewing PRs) +3. Whether this is a common enough task that a skill likely exists + +### Step 2: Check the Leaderboard First + +Before running a CLI search, check the [skills.sh leaderboard](https://skills.sh/) to see if a well-known skill already exists for the domain. The leaderboard ranks skills by total installs, surfacing the most popular and battle-tested options. + +For example, top skills for web development include: +- `vercel-labs/agent-skills` — React, Next.js, web design (100K+ installs each) +- `anthropics/skills` — Frontend design, document processing (100K+ installs) + +### Step 3: Search for Skills + +If the leaderboard doesn't cover the user's need, run the find command: + +```bash +npx skills find [query] +``` + +For example: + +- User asks "how do I make my React app faster?" → `npx skills find react performance` +- User asks "can you help me with PR reviews?" → `npx skills find pr review` +- User asks "I need to create a changelog" → `npx skills find changelog` + +### Step 4: Verify Quality Before Recommending + +**Do not recommend a skill based solely on search results.** Always verify: + +1. **Install count** — Prefer skills with 1K+ installs. Be cautious with anything under 100. +2. **Source reputation** — Official sources (`vercel-labs`, `anthropics`, `microsoft`) are more trustworthy than unknown authors. +3. **GitHub stars** — Check the source repository. A skill from a repo with <100 stars should be treated with skepticism. + +### Step 5: Present Options to the User + +When you find relevant skills, present them to the user with: + +1. The skill name and what it does +2. The install count and source +3. The install command they can run +4. A link to learn more at skills.sh + +Example response: + +``` +I found a skill that might help! The "react-best-practices" skill provides +React and Next.js performance optimization guidelines from Vercel Engineering. +(185K installs) + +To install it: +npx skills add vercel-labs/agent-skills@react-best-practices + +Learn more: https://skills.sh/vercel-labs/agent-skills/react-best-practices +``` + +### Step 6: Offer to Install + +If the user wants to proceed, you can install the skill for them: + +```bash +npx skills add -g -y +``` + +The `-g` flag installs globally (user-level) and `-y` skips confirmation prompts. + +## Common Skill Categories + +When searching, consider these common categories: + +| Category | Example Queries | +| --------------- | ---------------------------------------- | +| Web Development | react, nextjs, typescript, css, tailwind | +| Testing | testing, jest, playwright, e2e | +| DevOps | deploy, docker, kubernetes, ci-cd | +| Documentation | docs, readme, changelog, api-docs | +| Code Quality | review, lint, refactor, best-practices | +| Design | ui, ux, design-system, accessibility | +| Productivity | workflow, automation, git | + +## Tips for Effective Searches + +1. **Use specific keywords**: "react testing" is better than just "testing" +2. **Try alternative terms**: If "deploy" doesn't work, try "deployment" or "ci-cd" +3. **Check popular sources**: Many skills come from `vercel-labs/agent-skills` or `ComposioHQ/awesome-claude-skills` + +## When No Skills Are Found + +If no relevant skills exist: + +1. Acknowledge that no existing skill was found +2. Offer to help with the task directly using your general capabilities +3. Suggest the user could create their own skill with `npx skills init` + +Example: + +``` +I searched for skills related to "xyz" but didn't find any matches. +I can still help you with this task directly! Would you like me to proceed? + +If this is something you do often, you could create your own skill: +npx skills init my-xyz-skill +``` diff --git a/scripts/workstation/claude-skills/grill-me/SKILL.md b/scripts/workstation/claude-skills/grill-me/SKILL.md new file mode 100644 index 00000000..bd04394c --- /dev/null +++ b/scripts/workstation/claude-skills/grill-me/SKILL.md @@ -0,0 +1,10 @@ +--- +name: grill-me +description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me". +--- + +Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer. + +Ask the questions one at a time. + +If a question can be answered by exploring the codebase, explore the codebase instead. diff --git a/scripts/workstation/claude-skills/grill-with-docs/ADR-FORMAT.md b/scripts/workstation/claude-skills/grill-with-docs/ADR-FORMAT.md new file mode 100644 index 00000000..da7e78ec --- /dev/null +++ b/scripts/workstation/claude-skills/grill-with-docs/ADR-FORMAT.md @@ -0,0 +1,47 @@ +# ADR Format + +ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc. + +Create the `docs/adr/` directory lazily — only when the first ADR is needed. + +## Template + +```md +# {Short title of the decision} + +{1-3 sentences: what's the context, what did we decide, and why.} +``` + +That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections. + +## Optional sections + +Only include these when they add genuine value. Most ADRs won't need them. + +- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited +- **Considered Options** — only when the rejected alternatives are worth remembering +- **Consequences** — only when non-obvious downstream effects need to be called out + +## Numbering + +Scan `docs/adr/` for the highest existing number and increment by one. + +## When to offer an ADR + +All three of these must be true: + +1. **Hard to reverse** — the cost of changing your mind later is meaningful +2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?" +3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons + +If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing." + +### What qualifies + +- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres." +- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP." +- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out. +- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s. +- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate. +- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract." +- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months. diff --git a/scripts/workstation/claude-skills/grill-with-docs/CONTEXT-FORMAT.md b/scripts/workstation/claude-skills/grill-with-docs/CONTEXT-FORMAT.md new file mode 100644 index 00000000..eaf2a185 --- /dev/null +++ b/scripts/workstation/claude-skills/grill-with-docs/CONTEXT-FORMAT.md @@ -0,0 +1,60 @@ +# CONTEXT.md Format + +## Structure + +```md +# {Context Name} + +{One or two sentence description of what this context is and why it exists.} + +## Language + +**Order**: +{A one or two sentence description of the term} +_Avoid_: Purchase, transaction + +**Invoice**: +A request for payment sent to a customer after delivery. +_Avoid_: Bill, payment request + +**Customer**: +A person or organization that places orders. +_Avoid_: Client, buyer, account +``` + +## Rules + +- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others under `_Avoid_`. +- **Keep definitions tight.** One or two sentences max. Define what it IS, not what it does. +- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs. +- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine. + +## Single vs multi-context repos + +**Single context (most repos):** One `CONTEXT.md` at the repo root. + +**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other: + +```md +# Context Map + +## Contexts + +- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders +- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments +- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping + +## Relationships + +- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking +- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices +- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money` +``` + +The skill infers which structure applies: + +- If `CONTEXT-MAP.md` exists, read it to find contexts +- If only a root `CONTEXT.md` exists, single context +- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved + +When multiple contexts exist, infer which one the current topic relates to. If unclear, ask. diff --git a/scripts/workstation/claude-skills/grill-with-docs/SKILL.md b/scripts/workstation/claude-skills/grill-with-docs/SKILL.md new file mode 100644 index 00000000..5ea0aa91 --- /dev/null +++ b/scripts/workstation/claude-skills/grill-with-docs/SKILL.md @@ -0,0 +1,88 @@ +--- +name: grill-with-docs +description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions. +--- + + + +Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer. + +Ask the questions one at a time, waiting for feedback on each question before continuing. + +If a question can be answered by exploring the codebase, explore the codebase instead. + + + + + +## Domain awareness + +During codebase exploration, also look for existing documentation: + +### File structure + +Most repos have a single context: + +``` +/ +├── CONTEXT.md +├── docs/ +│ └── adr/ +│ ├── 0001-event-sourced-orders.md +│ └── 0002-postgres-for-write-model.md +└── src/ +``` + +If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives: + +``` +/ +├── CONTEXT-MAP.md +├── docs/ +│ └── adr/ ← system-wide decisions +├── src/ +│ ├── ordering/ +│ │ ├── CONTEXT.md +│ │ └── docs/adr/ ← context-specific decisions +│ └── billing/ +│ ├── CONTEXT.md +│ └── docs/adr/ +``` + +Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed. + +## During the session + +### Challenge against the glossary + +When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?" + +### Sharpen fuzzy language + +When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things." + +### Discuss concrete scenarios + +When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts. + +### Cross-reference with code + +When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?" + +### Update CONTEXT.md inline + +When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md). + +`CONTEXT.md` should be totally devoid of implementation details. Do not treat `CONTEXT.md` as a spec, a scratch pad, or a repository for implementation decisions. It is a glossary and nothing else. + +### Offer ADRs sparingly + +Only offer to create an ADR when all three are true: + +1. **Hard to reverse** — the cost of changing your mind later is meaningful +2. **Surprising without context** — a future reader will wonder "why did they do it this way?" +3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons + +If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md). + + diff --git a/scripts/workstation/claude-skills/handoff/SKILL.md b/scripts/workstation/claude-skills/handoff/SKILL.md new file mode 100644 index 00000000..28bfb3ab --- /dev/null +++ b/scripts/workstation/claude-skills/handoff/SKILL.md @@ -0,0 +1,13 @@ +--- +name: handoff +description: Compact the current conversation into a handoff document for another agent to pick up. +argument-hint: "What will the next session be used for?" +--- + +Write a handoff document summarising the current conversation so a fresh agent can continue the work. Save it to a path produced by `mktemp -t handoff-XXXXXX.md` (read the file before you write to it). + +Suggest the skills to be used, if any, by the next session. + +Do not duplicate content already captured in other artifacts (PRDs, plans, ADRs, issues, commits, diffs). Reference them by path or URL instead. + +If the user passed arguments, treat them as a description of what the next session will focus on and tailor the doc accordingly. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/DEEPENING.md b/scripts/workstation/claude-skills/improve-codebase-architecture/DEEPENING.md new file mode 100644 index 00000000..ecaf5d7d --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/DEEPENING.md @@ -0,0 +1,37 @@ +# Deepening + +How to deepen a cluster of shallow modules safely, given its dependencies. Assumes the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**. + +## Dependency categories + +When assessing a candidate for deepening, classify its dependencies. The category determines how the deepened module is tested across its seam. + +### 1. In-process + +Pure computation, in-memory state, no I/O. Always deepenable — merge the modules and test through the new interface directly. No adapter needed. + +### 2. Local-substitutable + +Dependencies that have local test stand-ins (PGLite for Postgres, in-memory filesystem). Deepenable if the stand-in exists. The deepened module is tested with the stand-in running in the test suite. The seam is internal; no port at the module's external interface. + +### 3. Remote but owned (Ports & Adapters) + +Your own services across a network boundary (microservices, internal APIs). Define a **port** (interface) at the seam. The deep module owns the logic; the transport is injected as an **adapter**. Tests use an in-memory adapter. Production uses an HTTP/gRPC/queue adapter. + +Recommendation shape: *"Define a port at the seam, implement an HTTP adapter for production and an in-memory adapter for testing, so the logic sits in one deep module even though it's deployed across a network."* + +### 4. True external (Mock) + +Third-party services (Stripe, Twilio, etc.) you don't control. The deepened module takes the external dependency as an injected port; tests provide a mock adapter. + +## Seam discipline + +- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a port unless at least two adapters are justified (typically production + test). A single-adapter seam is just indirection. +- **Internal seams vs external seams.** A deep module can have internal seams (private to its implementation, used by its own tests) as well as the external seam at its interface. Don't expose internal seams through the interface just because tests use them. + +## Testing strategy: replace, don't layer + +- Old unit tests on shallow modules become waste once tests at the deepened module's interface exist — delete them. +- Write new tests at the deepened module's interface. The **interface is the test surface**. +- Tests assert on observable outcomes through the interface, not internal state. +- Tests should survive internal refactors — they describe behaviour, not implementation. If a test has to change when the implementation changes, it's testing past the interface. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/HTML-REPORT.md b/scripts/workstation/claude-skills/improve-codebase-architecture/HTML-REPORT.md new file mode 100644 index 00000000..8adc368f --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/HTML-REPORT.md @@ -0,0 +1,123 @@ +# HTML Report Format + +The architectural review is rendered as a single self-contained HTML file in the OS temp directory. Tailwind and Mermaid both come from CDNs. Mermaid handles graph-shaped diagrams reliably; hand-built divs and inline SVG handle the more editorial visuals (mass diagrams, cross-sections). Mix the two — don't lean on Mermaid for everything, it'll start to look generic. + +## Scaffold + +```html + + + + + Architecture review — {{repo name}} + + + + + +
+
...
+
...
+
...
+
+ + +``` + +## Header + +Repo name, date, and a compact legend: solid box = module, dashed line = seam, red arrow = leakage, thick dark box = deep module. No introduction paragraph — straight into the candidates. + +## Candidate card + +The diagrams carry the weight. Prose is sparse, plain, and uses the glossary terms ([LANGUAGE.md](LANGUAGE.md)) without ceremony. + +Each candidate is one `
`: + +- **Title** — short, names the deepening (e.g. "Collapse the Order intake pipeline"). +- **Badge row** — recommendation strength (`Strong` = emerald, `Worth exploring` = amber, `Speculative` = slate), plus a tag for the dependency category (`in-process`, `local-substitutable`, `ports & adapters`, `mock`). +- **Files** — monospaced list, `font-mono text-sm`. +- **Before / After diagram** — the centrepiece. Two columns, side by side. See patterns below. +- **Problem** — one sentence. What hurts. +- **Solution** — one sentence. What changes. +- **Wins** — bullets, ≤6 words each. e.g. "Tests hit one interface", "Pricing logic stops leaking", "Delete 4 shallow wrappers". +- **ADR callout** (if applicable) — one line in an amber-tinted box. + +No paragraphs of explanation. If the diagram needs a paragraph to be understood, redraw the diagram. + +## Diagram patterns + +Pick the pattern that fits the candidate. Mix them. Don't make every diagram look the same — variety is part of the point. + +### Mermaid graph (the workhorse for dependencies / call flow) + +Use a Mermaid `flowchart` or `graph` when the point is "X calls Y calls Z, and look at the mess." Wrap it in a Tailwind-styled card so it doesn't feel parachuted in. Style with classDef to colour leakage edges red and the deep module dark. Sequence diagrams work well for "before: 6 round-trips; after: 1." + +```html +
+
+    flowchart LR
+      A[OrderHandler] --> B[OrderValidator]
+      B --> C[OrderRepo]
+      C -.leak.-> D[PricingClient]
+      classDef leak stroke:#dc2626,stroke-width:2px;
+      class C,D leak
+  
+
+``` + +### Hand-built boxes-and-arrows (when Mermaid's layout fights you) + +Modules as `
`s with borders and labels. Arrows as inline SVG `` or `` elements positioned absolutely over a relative container. Reach for this when you want the "after" diagram to feel like one thick-bordered deep module with greyed-out internals — Mermaid won't render that with the right weight. + +### Cross-section (good for layered shallowness) + +Stack horizontal bands (`h-12 border-l-4`) to show layers a call passes through. Before: 6 thin layers each doing nothing. After: 1 thick band labelled with the consolidated responsibility. + +### Mass diagram (good for "interface as wide as implementation") + +Two rectangles per module — one for interface surface area, one for implementation. Before: interface rectangle is nearly as tall as the implementation rectangle (shallow). After: interface rectangle is short, implementation rectangle is tall (deep). + +### Call-graph collapse + +Before: a tree of function calls rendered as nested boxes. After: the same tree collapsed into one box, with the now-internal calls shown faded inside it. + +## Style guidance + +- Lean editorial, not corporate-dashboard. Generous whitespace. Serif optional for headings (`font-serif` works well with stone/slate). +- Colour sparingly: one accent (emerald or indigo) plus red for leakage and amber for warnings. +- Keep diagrams ~320px tall so before/after sits comfortably side by side without scrolling. +- Use `text-xs uppercase tracking-wider` for module labels inside diagrams — they should read as schematic, not as UI. +- The only scripts are the Tailwind CDN and the Mermaid ESM import. The report is otherwise static — no app code, no interactivity beyond Mermaid's own rendering. + +## Top recommendation section + +One larger card. Candidate name, one sentence on why, anchor link to its card. That's it. + +## Tone + +Plain English, concise — but the architectural nouns and verbs come straight from [LANGUAGE.md](LANGUAGE.md). Concision is not an excuse to drift. + +**Use exactly:** module, interface, implementation, depth, deep, shallow, seam, adapter, leverage, locality. + +**Never substitute:** component, service, unit (for module) · API, signature (for interface) · boundary (for seam) · layer, wrapper (for module, when you mean module). + +**Phrasings that fit the style:** + +- "Order intake module is shallow — interface nearly matches the implementation." +- "Pricing leaks across the seam." +- "Deepen: one interface, one place to test." +- "Two adapters justify the seam: HTTP in prod, in-memory in tests." + +**Wins bullets** name the gain in glossary terms: *"locality: bugs concentrate in one module"*, *"leverage: one interface, N call sites"*, *"interface shrinks; implementation absorbs the wrappers"*. Don't write *"easier to maintain"* or *"cleaner code"* — those terms aren't in the glossary and don't earn their place. + +No hedging, no throat-clearing, no "it's worth noting that…". If a sentence could be a bullet, make it a bullet. If a bullet could be cut, cut it. If a term isn't in [LANGUAGE.md](LANGUAGE.md), reach for one that is before inventing a new one. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/INTERFACE-DESIGN.md b/scripts/workstation/claude-skills/improve-codebase-architecture/INTERFACE-DESIGN.md new file mode 100644 index 00000000..3197723a --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/INTERFACE-DESIGN.md @@ -0,0 +1,44 @@ +# Interface Design + +When the user wants to explore alternative interfaces for a chosen deepening candidate, use this parallel sub-agent pattern. Based on "Design It Twice" (Ousterhout) — your first idea is unlikely to be the best. + +Uses the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**, **leverage**. + +## Process + +### 1. Frame the problem space + +Before spawning sub-agents, write a user-facing explanation of the problem space for the chosen candidate: + +- The constraints any new interface would need to satisfy +- The dependencies it would rely on, and which category they fall into (see [DEEPENING.md](DEEPENING.md)) +- A rough illustrative code sketch to ground the constraints — not a proposal, just a way to make the constraints concrete + +Show this to the user, then immediately proceed to Step 2. The user reads and thinks while the sub-agents work in parallel. + +### 2. Spawn sub-agents + +Spawn 3+ sub-agents in parallel using the Agent tool. Each must produce a **radically different** interface for the deepened module. + +Prompt each sub-agent with a separate technical brief (file paths, coupling details, dependency category from [DEEPENING.md](DEEPENING.md), what sits behind the seam). The brief is independent of the user-facing problem-space explanation in Step 1. Give each agent a different design constraint: + +- Agent 1: "Minimize the interface — aim for 1–3 entry points max. Maximise leverage per entry point." +- Agent 2: "Maximise flexibility — support many use cases and extension." +- Agent 3: "Optimise for the most common caller — make the default case trivial." +- Agent 4 (if applicable): "Design around ports & adapters for cross-seam dependencies." + +Include both [LANGUAGE.md](LANGUAGE.md) vocabulary and CONTEXT.md vocabulary in the brief so each sub-agent names things consistently with the architecture language and the project's domain language. + +Each sub-agent outputs: + +1. Interface (types, methods, params — plus invariants, ordering, error modes) +2. Usage example showing how callers use it +3. What the implementation hides behind the seam +4. Dependency strategy and adapters (see [DEEPENING.md](DEEPENING.md)) +5. Trade-offs — where leverage is high, where it's thin + +### 3. Present and compare + +Present designs sequentially so the user can absorb each one, then compare them in prose. Contrast by **depth** (leverage at the interface), **locality** (where change concentrates), and **seam placement**. + +After comparing, give your own recommendation: which design you think is strongest and why. If elements from different designs would combine well, propose a hybrid. Be opinionated — the user wants a strong read, not a menu. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/LANGUAGE.md b/scripts/workstation/claude-skills/improve-codebase-architecture/LANGUAGE.md new file mode 100644 index 00000000..530c2763 --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/LANGUAGE.md @@ -0,0 +1,53 @@ +# Language + +Shared vocabulary for every suggestion this skill makes. Use these terms exactly — don't substitute "component," "service," "API," or "boundary." Consistent language is the whole point. + +## Terms + +**Module** +Anything with an interface and an implementation. Deliberately scale-agnostic — applies equally to a function, class, package, or tier-spanning slice. +_Avoid_: unit, component, service. + +**Interface** +Everything a caller must know to use the module correctly. Includes the type signature, but also invariants, ordering constraints, error modes, required configuration, and performance characteristics. +_Avoid_: API, signature (too narrow — those refer only to the type-level surface). + +**Implementation** +What's inside a module — its body of code. Distinct from **Adapter**: a thing can be a small adapter with a large implementation (a Postgres repo) or a large adapter with a small implementation (an in-memory fake). Reach for "adapter" when the seam is the topic; "implementation" otherwise. + +**Depth** +Leverage at the interface — the amount of behaviour a caller (or test) can exercise per unit of interface they have to learn. A module is **deep** when a large amount of behaviour sits behind a small interface. A module is **shallow** when the interface is nearly as complex as the implementation. + +**Seam** _(from Michael Feathers)_ +A place where you can alter behaviour without editing in that place. The *location* at which a module's interface lives. Choosing where to put the seam is its own design decision, distinct from what goes behind it. +_Avoid_: boundary (overloaded with DDD's bounded context). + +**Adapter** +A concrete thing that satisfies an interface at a seam. Describes *role* (what slot it fills), not substance (what's inside). + +**Leverage** +What callers get from depth. More capability per unit of interface they have to learn. One implementation pays back across N call sites and M tests. + +**Locality** +What maintainers get from depth. Change, bugs, knowledge, and verification concentrate at one place rather than spreading across callers. Fix once, fixed everywhere. + +## Principles + +- **Depth is a property of the interface, not the implementation.** A deep module can be internally composed of small, mockable, swappable parts — they just aren't part of the interface. A module can have **internal seams** (private to its implementation, used by its own tests) as well as the **external seam** at its interface. +- **The deletion test.** Imagine deleting the module. If complexity vanishes, the module wasn't hiding anything (it was a pass-through). If complexity reappears across N callers, the module was earning its keep. +- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test *past* the interface, the module is probably the wrong shape. +- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a seam unless something actually varies across it. + +## Relationships + +- A **Module** has exactly one **Interface** (the surface it presents to callers and tests). +- **Depth** is a property of a **Module**, measured against its **Interface**. +- A **Seam** is where a **Module**'s **Interface** lives. +- An **Adapter** sits at a **Seam** and satisfies the **Interface**. +- **Depth** produces **Leverage** for callers and **Locality** for maintainers. + +## Rejected framings + +- **Depth as ratio of implementation-lines to interface-lines** (Ousterhout): rewards padding the implementation. We use depth-as-leverage instead. +- **"Interface" as the TypeScript `interface` keyword or a class's public methods**: too narrow — interface here includes every fact a caller must know. +- **"Boundary"**: overloaded with DDD's bounded context. Say **seam** or **interface**. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/SKILL.md b/scripts/workstation/claude-skills/improve-codebase-architecture/SKILL.md new file mode 100644 index 00000000..c12b263b --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/SKILL.md @@ -0,0 +1,81 @@ +--- +name: improve-codebase-architecture +description: Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable. +--- + +# Improve Codebase Architecture + +Surface architectural friction and propose **deepening opportunities** — refactors that turn shallow modules into deep ones. The aim is testability and AI-navigability. + +## Glossary + +Use these terms exactly in every suggestion. Consistent language is the point — don't drift into "component," "service," "API," or "boundary." Full definitions in [LANGUAGE.md](LANGUAGE.md). + +- **Module** — anything with an interface and an implementation (function, class, package, slice). +- **Interface** — everything a caller must know to use the module: types, invariants, error modes, ordering, config. Not just the type signature. +- **Implementation** — the code inside. +- **Depth** — leverage at the interface: a lot of behaviour behind a small interface. **Deep** = high leverage. **Shallow** = interface nearly as complex as the implementation. +- **Seam** — where an interface lives; a place behaviour can be altered without editing in place. (Use this, not "boundary.") +- **Adapter** — a concrete thing satisfying an interface at a seam. +- **Leverage** — what callers get from depth. +- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated in one place. + +Key principles (see [LANGUAGE.md](LANGUAGE.md) for the full list): + +- **Deletion test**: imagine deleting the module. If complexity vanishes, it was a pass-through. If complexity reappears across N callers, it was earning its keep. +- **The interface is the test surface.** +- **One adapter = hypothetical seam. Two adapters = real seam.** + +This skill is _informed_ by the project's domain model. The domain language gives names to good seams; ADRs record decisions the skill should not re-litigate. + +## Process + +### 1. Explore + +Read the project's domain glossary and any ADRs in the area you're touching first. + +Then use the Agent tool with `subagent_type=Explore` to walk the codebase. Don't follow rigid heuristics — explore organically and note where you experience friction: + +- Where does understanding one concept require bouncing between many small modules? +- Where are modules **shallow** — interface nearly as complex as the implementation? +- Where have pure functions been extracted just for testability, but the real bugs hide in how they're called (no **locality**)? +- Where do tightly-coupled modules leak across their seams? +- Which parts of the codebase are untested, or hard to test through their current interface? + +Apply the **deletion test** to anything you suspect is shallow: would deleting it concentrate complexity, or just move it? A "yes, concentrates" is the signal you want. + +### 2. Present candidates as an HTML report + +Write a self-contained HTML file to the OS temp directory so nothing lands in the repo. Resolve the temp dir from `$TMPDIR`, falling back to `/tmp` (or `%TEMP%` on Windows), and write to `/architecture-review-.html` so each run gets a fresh file. Open it for the user — `xdg-open ` on Linux, `open ` on macOS, `start ` on Windows — and tell them the absolute path. + +The report uses **Tailwind via CDN** for layout and styling, and **Mermaid via CDN** for diagrams where a graph/flow/sequence reliably communicates the structure. Mix Mermaid with hand-crafted CSS/SVG visuals — use Mermaid when relationships are graph-shaped (call graphs, dependencies, sequences), and hand-built divs/SVG when you want something more editorial (mass diagrams, cross-sections, collapse animations). Each candidate gets a **before/after visualisation**. Be visual. + +For each candidate, the same template as before, but rendered as a card: + +- **Files** — which files/modules are involved +- **Problem** — why the current architecture is causing friction +- **Solution** — plain English description of what would change +- **Benefits** — explained in terms of locality and leverage, and how tests would improve +- **Before / After diagram** — side-by-side, custom-drawn, illustrating the shallowness and the deepening +- **Recommendation strength** — one of `Strong`, `Worth exploring`, `Speculative`, rendered as a badge + +End the report with a **Top recommendation** section: which candidate you'd tackle first and why. + +**Use CONTEXT.md vocabulary for the domain, and [LANGUAGE.md](LANGUAGE.md) vocabulary for the architecture.** If `CONTEXT.md` defines "Order," talk about "the Order intake module" — not "the FooBarHandler," and not "the Order service." + +**ADR conflicts**: if a candidate contradicts an existing ADR, only surface it when the friction is real enough to warrant revisiting the ADR. Mark it clearly in the card (e.g. a warning callout: _"contradicts ADR-0007 — but worth reopening because…"_). Don't list every theoretical refactor an ADR forbids. + +See [HTML-REPORT.md](HTML-REPORT.md) for the full HTML scaffold, diagram patterns, and styling guidance. + +Do NOT propose interfaces yet. After the file is written, ask the user: "Which of these would you like to explore?" + +### 3. Grilling loop + +Once the user picks a candidate, drop into a grilling conversation. Walk the design tree with them — constraints, dependencies, the shape of the deepened module, what sits behind the seam, what tests survive. + +Side effects happen inline as decisions crystallize: + +- **Naming a deepened module after a concept not in `CONTEXT.md`?** Add the term to `CONTEXT.md` — same discipline as `/grill-with-docs` (see [CONTEXT-FORMAT.md](../grill-with-docs/CONTEXT-FORMAT.md)). Create the file lazily if it doesn't exist. +- **Sharpening a fuzzy term during the conversation?** Update `CONTEXT.md` right there. +- **User rejects the candidate with a load-bearing reason?** Offer an ADR, framed as: _"Want me to record this as an ADR so future architecture reviews don't re-suggest it?"_ Only offer when the reason would actually be needed by a future explorer to avoid re-suggesting the same thing — skip ephemeral reasons ("not worth it right now") and self-evident ones. See [ADR-FORMAT.md](../grill-with-docs/ADR-FORMAT.md). +- **Want to explore alternative interfaces for the deepened module?** See [INTERFACE-DESIGN.md](INTERFACE-DESIGN.md). diff --git a/scripts/workstation/claude-skills/prototype/LOGIC.md b/scripts/workstation/claude-skills/prototype/LOGIC.md new file mode 100644 index 00000000..526ecb18 --- /dev/null +++ b/scripts/workstation/claude-skills/prototype/LOGIC.md @@ -0,0 +1,79 @@ +# Logic Prototype + +A tiny interactive terminal app that lets the user drive a state model by hand. Use this when the question is about **business logic, state transitions, or data shape** — the kind of thing that looks reasonable on paper but only feels wrong once you push it through real cases. + +## When this is the right shape + +- "I'm not sure if this state machine handles the edge case where X then Y." +- "Does this data model actually let me represent the case where..." +- "I want to feel out what the API should look like before writing it." +- Anything where the user wants to **press buttons and watch state change**. + +If the question is "what should this look like" — wrong branch. Use [UI.md](UI.md). + +## Process + +### 1. State the question + +Before writing code, write down what state model and what question you're prototyping. One paragraph, in the prototype's README or a comment at the top of the file. A logic prototype that answers the wrong question is pure waste — make the question explicit so it can be checked later, whether the user is watching now or returning to it AFK. + +### 2. Pick the language + +Use whatever the host project uses. If the project has no obvious runtime (e.g. a docs repo), ask. + +Match the project's existing conventions for tooling — don't add a new package manager or runtime just for the prototype. + +### 3. Isolate the logic in a portable module + +Put the actual logic — the bit that's answering the question — behind a small, pure interface that could be lifted out and dropped into the real codebase later. The TUI around it is throwaway; the logic module shouldn't be. + +The right shape depends on the question: + +- **A pure reducer** — `(state, action) => state`. Good when actions are discrete events and state is a single value. +- **A state machine** — explicit states and transitions. Good when "which actions are even legal right now" is part of the question. +- **A small set of pure functions** over a plain data type. Good when there's no implicit current state — just transformations. +- **A class or module with a clear method surface** when the logic genuinely owns ongoing internal state. + +Pick whichever shape best fits the question being asked, *not* whichever is easiest to wire to a TUI. Keep it pure: no I/O, no terminal code, no `console.log` for control flow. The TUI imports it and calls into it; nothing flows the other direction. + +This is what makes the prototype useful past its own lifetime. When the question's been answered, the validated reducer / machine / function set can be lifted into the real module — the TUI shell gets deleted. + +### 4. Build the smallest TUI that exposes the state + +Build it as a **lightweight TUI** — on every tick, clear the screen (`console.clear()` / `print("\033[2J\033[H")` / equivalent) and re-render the whole frame. The user should always see one stable view, not an ever-growing scrollback. + +Each frame has two parts, in this order: + +1. **Current state**, pretty-printed and diff-friendly (one field per line, or formatted JSON). Use **bold** for field names or section headers and **dim** for less important context (timestamps, IDs, derived values). Native ANSI escape codes are fine — `\x1b[1m` bold, `\x1b[2m` dim, `\x1b[0m` reset. No need to pull in a styling library unless one is already in the project. +2. **Keyboard shortcuts**, listed at the bottom: `[a] add user [d] delete user [t] tick clock [q] quit`. Bold the key, dim the description, or vice-versa — whatever reads cleanly. + +Behaviour: + +1. **Initialise state** — a single in-memory object/struct. Render the first frame on start. +2. **Read one keystroke (or one line)** at a time, dispatch to a handler that mutates state. +3. **Re-render** the full frame after every action — don't append, replace. +4. **Loop until quit.** + +The whole frame should fit on one screen. + +### 5. Make it runnable in one command + +Add a script to the project's existing task runner (`package.json` scripts, `Makefile`, `justfile`, `pyproject.toml`). The user should run `pnpm run ` or equivalent — never need to remember a path. + +If the host project has no task runner, just put the command at the top of the prototype's README. + +### 6. Hand it over + +Give the user the run command. They'll drive it themselves; the interesting moments are when they say "wait, that shouldn't be possible" or "huh, I assumed X would be different" — those are the bugs in the _idea_, which is the whole point. If they want new actions added, add them. Prototypes evolve. + +### 7. Capture the answer + +When the prototype has done its job, the answer to the question is the only thing worth keeping. If the user is around, ask what it taught them. If not, leave a `NOTES.md` next to the prototype so the answer can be filled in (or filled in by you, if you've watched the session) before the prototype gets deleted. + +## Anti-patterns + +- **Don't add tests.** A prototype that needs tests is no longer a prototype. +- **Don't wire it to the real database.** Use an in-memory store unless the question is specifically about persistence. +- **Don't generalise.** No "what if we wanted to support X later." The prototype answers one question. +- **Don't blur the logic and the TUI together.** If the reducer / state machine references `console.log`, prompts, or terminal escape codes, it's no longer portable. Keep the TUI as a thin shell over a pure module. +- **Don't ship the TUI shell into production.** The shell is optimised for being driven by hand from a terminal. The logic module behind it is the bit worth keeping. diff --git a/scripts/workstation/claude-skills/prototype/SKILL.md b/scripts/workstation/claude-skills/prototype/SKILL.md new file mode 100644 index 00000000..64f3e611 --- /dev/null +++ b/scripts/workstation/claude-skills/prototype/SKILL.md @@ -0,0 +1,30 @@ +--- +name: prototype +description: Build a throwaway prototype to flesh out a design before committing to it. Routes between two branches — a runnable terminal app for state/business-logic questions, or several radically different UI variations toggleable from one route. Use when the user wants to prototype, sanity-check a data model or state machine, mock up a UI, explore design options, or says "prototype this", "let me play with it", "try a few designs". +--- + +# Prototype + +A prototype is **throwaway code that answers a question**. The question decides the shape. + +## Pick a branch + +Identify which question is being answered — from the user's prompt, the surrounding code, or by asking if the user is around: + +- **"Does this logic / state model feel right?"** → [LOGIC.md](LOGIC.md). Build a tiny interactive terminal app that pushes the state machine through cases that are hard to reason about on paper. +- **"What should this look like?"** → [UI.md](UI.md). Generate several radically different UI variations on a single route, switchable via a URL search param and a floating bottom bar. + +The two branches produce very different artifacts — getting this wrong wastes the whole prototype. If the question is genuinely ambiguous and the user isn't reachable, default to whichever branch better matches the surrounding code (a backend module → logic; a page or component → UI) and state the assumption at the top of the prototype. + +## Rules that apply to both + +1. **Throwaway from day one, and clearly marked as such.** Locate the prototype code close to where it will actually be used (next to the module or page it's prototyping for) so context is obvious — but name it so a casual reader can see it's a prototype, not production. For throwaway UI routes, obey whatever routing convention the project already uses; don't invent a new top-level structure. +2. **One command to run.** Whatever the project's existing task runner supports — `pnpm `, `python `, `bun `, etc. The user must be able to start it without thinking. +3. **No persistence by default.** State lives in memory. Persistence is the thing the prototype is _checking_, not something it should depend on. If the question explicitly involves a database, hit a scratch DB or a local file with a clear "PROTOTYPE — wipe me" name. +4. **Skip the polish.** No tests, no error handling beyond what makes the prototype _runnable_, no abstractions. The point is to learn something fast and then delete it. +5. **Surface the state.** After every action (logic) or on every variant switch (UI), print or render the full relevant state so the user can see what changed. +6. **Delete or absorb when done.** When the prototype has answered its question, either delete it or fold the validated decision into the real code — don't leave it rotting in the repo. + +## When done + +The _answer_ is the only thing worth keeping from a prototype. Capture it somewhere durable (commit message, ADR, issue, or a `NOTES.md` next to the prototype) along with the question it was answering. If the user is around, that capture is a quick conversation; if not, leave the placeholder so they (or you, on the next pass) can fill in the verdict before deleting the prototype. diff --git a/scripts/workstation/claude-skills/prototype/UI.md b/scripts/workstation/claude-skills/prototype/UI.md new file mode 100644 index 00000000..f3b6e640 --- /dev/null +++ b/scripts/workstation/claude-skills/prototype/UI.md @@ -0,0 +1,112 @@ +# UI Prototype + +Generate **several radically different UI variations** on a single route, switchable from a floating bottom bar. The user flips between variants in the browser, picks one (or steals bits from each), then throws the rest away. + +If the question is about logic/state rather than what something looks like — wrong branch. Use [LOGIC.md](LOGIC.md). + +## When this is the right shape + +- "What should this page look like?" +- "I want to see a few options for this dashboard before committing." +- "Try a different layout for the settings screen." +- Any time the user would otherwise spend a day picking between three vague mockups in their head. + +## Two sub-shapes — strongly prefer sub-shape A + +A UI prototype is much easier to judge when it's **butting up against the rest of the app** — real header, real sidebar, real data, real density. A throwaway route on its own is a vacuum: every variant looks fine in isolation. Default to sub-shape A whenever there's a plausible existing page to host the variants. Only reach for sub-shape B if the prototype genuinely has no nearby home. + +### Sub-shape A — adjustment to an existing page (preferred) + +The route already exists. Variants are rendered **on the same route**, gated by a `?variant=` URL search param. The existing data fetching, params, and auth all stay — only the rendering swaps. This is the default; pick it unless there's a specific reason not to. + +If the prototype is for something that doesn't yet have a page but *would naturally live inside one* (a new section of the dashboard, a new card on the settings screen, a new step in an existing flow) — that's still sub-shape A. Mount the variants inside the host page. + +### Sub-shape B — a new page (last resort) + +Only use this when the thing being prototyped genuinely has no existing page to live inside — e.g. an entirely new top-level surface, or a flow that can't be embedded anywhere sensible. + +Create a **throwaway route** following whatever routing convention the project already uses — don't invent a new top-level structure. Name it so it's obviously a prototype (e.g. include the word `prototype` in the path or filename). Same `?variant=` pattern. + +Before committing to sub-shape B, sanity-check: is there really no existing page this could be embedded in? An empty route hides design problems that a populated one would expose. + +In both sub-shapes the floating bottom bar is identical. + +## Process + +### 1. State the question and pick N + +Default to **3 variants**. More than 5 stops being radically different and starts being noise — cap there. + +Write down the plan in one line, in the prototype's location or a top-of-file comment: + +> "Three variants of the settings page, switchable via `?variant=`, on the existing `/settings` route." + +This works whether the user is here to push back or not. + +### 2. Generate radically different variants + +Draft each variant. Hold each one to: + +- The page's purpose and the data it has access to. +- The project's component library / styling system (TailwindCSS, shadcn, MUI, plain CSS, whatever). +- A clear exported component name, e.g. `VariantA`, `VariantB`, `VariantC`. + +Variants must be **structurally different** — different layout, different information hierarchy, different primary affordance, not just different colours. Three slightly-tweaked card grids isn't a UI prototype, it's wallpaper. If two drafts come out too similar, redo one with explicit "do not use a card grid" guidance. + +### 3. Wire them together + +Create a single switcher component on the route: + +```tsx +// pseudo-code — adapt to the project's framework +const variant = searchParams.get('variant') ?? 'A'; +return ( + <> + {variant === 'A' && } + {variant === 'B' && } + {variant === 'C' && } + + +); +``` + +For sub-shape A (existing page): keep all the existing data fetching above the switcher; only the rendered subtree changes per variant. + +For sub-shape B (new page): the throwaway route under `/prototype/` mounts the same switcher. + +### 4. Build the floating switcher + +A small fixed-position bar at the bottom-centre of the screen with three pieces: + +- **Left arrow** — cycles to the previous variant (wraps around). +- **Variant label** — shows the current variant key and, if the variant exports a name, that name too. e.g. `B — Sidebar layout`. +- **Right arrow** — cycles forward (wraps around). + +Behaviour: + +- Clicking an arrow updates the URL search param (use the framework's router — `router.replace` on Next, `navigate` on React Router, etc) so the variant is shareable and reload-stable. +- Keyboard: `←` and `→` arrow keys also cycle. Don't intercept arrow keys when an ``, `