diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index da5f0f51..9c873a07 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -24,9 +24,9 @@ Violations cause state drift, which causes future applies to break or silently revert changes. ## Instructions -- **"remember X"**: Use `memory-tool store "content" --category facts --tags "tag1,tag2"` (via exec) for persistent cross-session memory. Also update this file + `AGENTS.md` (if shared knowledge), commit with `[ci skip]`. To recall: `memory-tool recall "query"`. To list: `memory-tool list`. To delete: `memory-tool delete `. The native `memory_search` and `memory_get` tools are also available for searching indexed memory files. For **storing** new memories, always use the `memory-tool` CLI via exec. -- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies. -- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma) +- **"remember X"**: store to the remote claude-memory store via the **`homelab memory` CLI**: `homelab memory store "content" --category facts --tags "tag1,tag2"` (also `recall "query"` / `update ` / `list` / `delete `). For shared knowledge, also update the relevant CLAUDE.md / `AGENTS.md`. (Supersedes the old `memory-tool` CLI **and** the claude-memory MCP — both retired 2026-06-21; the homelab CLI hits the same remote HTTP API. Recall also runs automatically each turn via a UserPromptSubmit hook.) +- **Apply**: Authenticate via `vault login -method=oidc`, then use `scripts/tg` (preferred — handles state decrypt/encrypt) or `terragrunt` directly. `scripts/tg` adds `-auto-approve` for `--non-interactive` applies, and `-lock-timeout` (default `5m`, override via `TG_LOCK_TIMEOUT`) on every state-locking verb (`plan`/`apply`/`destroy`/`refresh`) so a contended state lock **waits** instead of failing instantly with `Error acquiring the state lock`. +- **New services need CI/CD** and **monitoring** (Prometheus/Uptime Kuma). CI = a GHA workflow on the repo's GitHub mirror (build + tests off-infra, ADR-0002); Woodpecker gets a deploy-only pipeline — never an in-cluster build. - **New service**: Use `setup-project` skill for full workflow - **Ingress**: `ingress_factory` module. **Auth** (`auth` string enum, default `"required"` — fail-closed). Pick by asking "what gates the app?": - `auth = "required"` — Authentik forward-auth gates every request. Use when the backend has **no built-in user auth** and Authentik is the only thing standing between strangers and the app (prowlarr, qbittorrent, netbox, phpipam, k8s-dashboard, any admin UI shipped without its own login). @@ -38,7 +38,7 @@ Violations cause state drift, which causes future applies to break or silently r - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://..svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering. - **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern. -- **Private registry**: `forgejo.viktorbarzin.me/viktor/` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. Containerd `hosts.toml` on every node redirects to in-cluster Traefik LB `10.0.20.203` (with `skip_verify = true`, since the node dials Traefik by IP but the cert is for `forgejo.viktorbarzin.me`) to avoid hairpin NAT. That redirect covers **kubelet pulls** only — in-cluster pods (notably Woodpecker buildkit build pods pushing images) resolve `forgejo.viktorbarzin.me` via a CoreDNS `rewrite name exact ... traefik.traefik.svc.cluster.local` (Corefile in `stacks/technitium/modules/technitium/main.tf`), since they do NOT use the node containerd mirror; without it, buildkit pushes intermittently timed out on the public-IP hairpin (added 2026-06-04, beads code-yh33). **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left this redirect pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull (cached images kept running, so it stayed hidden until a new image tag was pulled). Redirect source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest`; integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. +- **Image registry**: **Owned images now live on `ghcr.io/viktorbarzin/`** (ADR-0002, built by GHA — see the CI/CD Architecture section). The **Forgejo container registry is FROZEN + emptied** (break-glass only — `docs/runbooks/forgejo-registry-breakglass.md`); nothing pushes to it. The rest of this bullet documents the **still-live forgejo-pull DNS/mirror machinery** (it remains in place for the break-glass path + because `registry-credentials` is still Kyverno-synced; the hairpin lessons apply to any internal-registry pull). Historical usage was `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are ordinary internal clients too** (since 2026-06-10 evening) — CoreDNS's dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`) forwards to the Technitium ClusterIP `10.96.0.53`, so pods get the same split-horizon answers as everyone else; forgejo stays pinned to Traefik's **ClusterIP** in that block (TF-interpolated from the live Service) so CI pushes survive a Technitium outage. This relies on a k8s-1.34 behavior verified 2026-06-10: **pods CAN reach the ETP=Local Traefik LB IP** (kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path) — re-verify after major k8s upgrades; canary = the uptime-kuma `[External]` fleet going red. (The block briefly forwarded to `8.8.8.8/1.1.1.1` earlier that day, which kept pods on the WAN IP and the broken TP-Link NAT loopback — 27 non-proxied `[External]` monitors dark; beads code-yh33.) **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts. - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`. @@ -47,7 +47,7 @@ Violations cause state drift, which causes future applies to break or silently r ## Terraform State — Two-Tier Backend - **Tier 0 (bootstrap)**: Local state, SOPS-encrypted in git. Stacks: `infra`, `platform`, `cnpg`, `vault`, `dbaas`, `external-secrets`. These must exist before PG is reachable. -- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. +- **Tier 1 (everything else)**: PostgreSQL backend (`pg`) on CNPG cluster at `pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`. Native `pg_advisory_lock` for concurrent safety. Each stack gets its own PG schema. **Lock contention is non-fatal**: `scripts/tg` passes `-lock-timeout` (default `5m`) so a contended lock waits rather than hard-failing — this was the #1 cause of infra CI failures (a Woodpecker-killed run's unreaped PG lock, a concurrent local apply, or the daily drift `plan`; Tier-1 stacks have no Vault advisory-lock skip to fall back on, unlike Tier-0). - **Auth**: `scripts/tg` auto-fetches PG credentials from Vault (`database/static-creds/pg-terraform-state`). Humans use `vault login -method=oidc`, agents use K8s auth (role: `terraform-state`, namespace: `claude-agent`). - **Tier 0 workflow** (unchanged): `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`. State sync via SOPS is transparent. - **Tier 1 workflow**: `vault login -method=oidc` → `scripts/tg plan` → `scripts/tg apply`. No git commit needed — PG is authoritative. @@ -63,7 +63,7 @@ Violations cause state drift, which causes future applies to break or silently r - **`secret/viktor`** — go-to path for ALL personal secrets (135 keys). Contains every API key, token, password, SSH key, and config from the old terraform.tfvars. Check here first: `vault kv get -field=KEY secret/viktor`. - **Auth**: `vault login -method=oidc` (Authentik SSO) → `~/.vault-token` → read by Vault TF provider. - **Vault stack self-reads**: `data "vault_kv_secret_v2" "vault"` reads its own OIDC creds from `secret/vault`. -- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — 43 ExternalSecrets + 9 DB-creds ExternalSecrets. API version `v1beta1`. Two ClusterSecretStores: `vault-kv` and `vault-database`. +- **ESO (External Secrets Operator)**: `stacks/external-secrets/` — chart **2.6.0 / app v2.6.0** (migrated 0.12.1→2.6.0 on 2026-06-22, one minor at a time; helm_release has `atomic=true`). **~104 ExternalSecrets across 73 files**, all on **API version `v1`** (migrated v1beta1→v1 on 2026-06-22 — there is NO v1beta1→v1 conversion webhook, so all CRs were rewritten to v1 on chart 0.16.2 before 0.17 removed v1beta1; see `docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md`). Two ClusterSecretStores: `vault-kv` and `vault-database`. (2 pre-existing dead ESs — instagram-poster, payslip-ingest — fail "cannot find secret data" on missing Vault keys, unrelated.) - **Plan-time pattern**: Former plan-time stacks use `data "kubernetes_secret"` to read ESO-created K8s Secrets at plan time (no Vault dependency). First-apply gotcha: must `terragrunt apply -target=kubernetes_manifest.external_secret` first, then full apply. `count` on resources using secret values fails — remove conditional counts. - **14 hybrid stacks** still keep `data "vault_kv_secret_v2"` for plan-time needs (job commands, Helm templatefile, module inputs). Platform has 48 plan-time refs — no migration possible without restructuring modules. - **Database rotation**: Vault DB engine rotates passwords every 7 days (604800s). MySQL: speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam. PostgreSQL: health, linkwarden, affine, woodpecker, claude_memory, crowdsec, technitium. Excluded: authentik (PgBouncer), root users. **Apps that read a rotated secret only at startup** (env var / initContainer, not a hot-reloaded mount) MUST carry a Reloader annotation (`secret.reloader.stakater.com/reload: `) or they keep the stale password and silently fail DB auth on each rotation until manually restarted — matrix's Synapse `inject-db-password` initContainer hit exactly this (found via Loki 2026-06-05, ~12.9k auth-fail lines/hr); matrix has since migrated to tuwunel (RocksDB, no Postgres) on 2026-06-08 and is no longer in the rotation list above. Technitium uses a password-sync CronJob (every 6h) to push rotated password to the Technitium app config via API, disable SQLite + MySQL logging, check PG plugin is loaded, configure PG query logging (90-day retention), and disable SQLite on secondary/tertiary instances. @@ -78,63 +78,123 @@ Violations cause state drift, which causes future applies to break or silently r ## Resource Management Patterns - **CPU**: All CPU limits removed cluster-wide (CFS throttling). Only set CPU requests based on actual usage. - **Memory**: Set explicit `requests=limits` based on VPA upperBound. Target: upperBound x 1.2 for stable services, x 1.3 for GPU/volatile workloads. -- **VPA (Goldilocks)**: Must be `Initial` mode (not `Auto`) — Auto conflicts with Terraform's declarative resource management. +- **Right-sizing**: VPA/Goldilocks was **REMOVED 2026-06-12** (etcd-load-reduction — 349 VPAs all ran `updateMode=Off`, costing ~800 etcd objects + continuous recommender writes + a pod-creation admission webhook for dashboard-only value). Right-size **on demand with `krr`** (Robusta, Dockerized from the devvm — no cluster install, no admission webhook, no eviction risk; reads Prometheus). Set container resources explicitly in TF from krr output. - **LimitRange**: Tier-based defaults silently apply to pods with `resources: {}`. Always set explicit resources on containers needing more than defaults. Tier 3-edge and 4-aux now use Burstable QoS (request < limit) to reduce scheduler pressure. - **Democratic-CSI sidecars**: Must set explicit resources (32-80Mi) in Helm values — 17 sidecars default to 256Mi each via LimitRange. `csiProxy` is a TOP-LEVEL chart key, not nested under controller/node. - **ResourceQuota blocks rolling updates**: When quota is tight, scale to 0 then back to 1 instead of RollingUpdate. Or use Recreate strategy. - **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Every `kubernetes_deployment`, `kubernetes_stateful_set`, and `kubernetes_cron_job_v1` MUST include `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 }` (use `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` for CronJobs). The `# KYVERNO_LIFECYCLE_V1` marker is the canonical discoverability tag — grep for it to locate every site. A shared Terraform module was considered but `ignore_changes` only accepts static attribute paths (not module outputs, locals, or expressions), so the snippet convention is the only viable path. Full rationale and copy-paste snippets in `AGENTS.md` → "Kyverno Drift Suppression". - **NVIDIA GPU operator resources**: dcgm-exporter and cuda-validator resources configurable via `dcgmExporter.resources` and `validator.resources` in nvidia values.yaml. - **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis. -- **Quarterly right-sizing**: Check Goldilocks dashboard. Compare VPA upperBound to current request. Also check for under-provisioned (VPA upper > request x 0.8). +- **Quarterly right-sizing**: Run `krr` (Dockerized, against Prometheus) for recommendations; compare to current requests and adjust in TF. (Goldilocks dashboard removed 2026-06-12.) -## CI/CD Architecture — GHA Builds + Woodpecker Deploy +## CI/CD Architecture — GHA Builds → ghcr + Woodpecker Deploy -**Owned-app deploy model (build triggers the rollout — 2026-06-02):** For -self-hosted apps **we build** (Forgejo `viktor/` + Dockerfile + -`.woodpecker.yml`), the build pipeline ALSO drives the rollout — atomic + -deterministic, no wait for Keel's poll. Pattern (`build-and-push` tags `latest` -+ `${CI_COMMIT_SHA:0:8}`, then a `deploy` step): `kubectl set image -deployment/ =:${CI_COMMIT_SHA:0:8} -n ` + -`kubectl rollout status ... --timeout=300s`. The `woodpecker-agent` SA is -`cluster-admin`, so the `bitnami/kubectl` step needs no kubeconfig/RBAC (uses -its in-cluster SA). **Keel stays enrolled in parallel** as a redundant net -(finds the deployed SHA already running → no-op). Requires the Deployment to -have `ignore_changes` on `…container[0].image` (KEEL_IGNORE_IMAGE) so CI -`set image` doesn't fight `terragrunt apply`. CronJobs in owned apps use -`:latest` + `imagePullPolicy: Always` (fresh pod each run) instead of a deploy -step. **Never** `set image`/`rollout restart` operator-managed StatefulSets -(memory id=740). Reference impls: `tuya_bridge/.woodpecker.yml`, -`job-hunter`, `f1-stream` (viktor/f1-stream, extracted from this monorepo -2026-06-05). This reverses decision #12 of -`docs/plans/2026-05-16-auto-upgrade-apps-design.md` for owned (not upstream) -images. +**Doctrine (ADR-0002, fleet-wide as of 2026-06-13): ALL image builds + CI +compute run OFF-infra.** Every owned image is built/linted/tested on GitHub +Actions (public repos: free; private: 2000 free min/mo) and pushed to +`ghcr.io/viktorbarzin/`. **No in-cluster image builds or CI test runs +exist anywhere** — the in-cluster Woodpecker buildkit and the fallback-build +pattern were removed (clean cut). Woodpecker is **deploy-only** (plus infra +applies + maintenance crons). Canonical CI/CD reference: +`docs/architecture/ci-cd.md`; decision: `docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md`. +**Watch what you trigger**: after a push that fires a build chain, follow it to +completion (GHA run → Woodpecker deploy → `rollout status`) and fix failures; +verify via live state, not the checkmark. -**Flow (GHA-migrated apps)**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image` +**The fleet pattern (every owned app):** Forgejo `viktor/` (canonical) +push-mirrors (`sync_on_commit`) → GitHub `ViktorBarzin/` → GHA +`.github/workflows/build.yml` (committed on Forgejo, mirrors over): `on: push: +branches:[master]` ONLY (feature branches mirror but build/deploy nothing — the +safety valve). The `build` job: lint/test → `svu` cuts the next `vX.Y.Z` tag to +CANONICAL Forgejo (GHA secret `FORGEJO_GIT_TOKEN` = write:repository PAT) + bakes +`VERSION` → `buildx` `linux/amd64` `provenance:false` (single-manifest, dodges +the orphaned-index-children class) → push `ghcr.io/viktorbarzin/:` + +`:latest` → `delete-package-versions` keep-10. The `deploy` job POSTs +`ci.viktorbarzin.me/api/repos//pipelines` (the GitHub-mirror's Woodpecker +registration, github-forge; GHA secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + +`IMAGE_NAME` → `.woodpecker/deploy.yml` (event:**manual** ONLY, so the raw +Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set +image deployment/ …` in-cluster (woodpecker-agent SA = cluster-admin, no +kubeconfig). Deployment image is `ignore_changes`/KEEL_IGNORE_IMAGE so the SHA +sticks vs `terragrunt apply`; CronJobs track `:latest` + `imagePullPolicy: +Always`. **Keel stays enrolled** as a redundant net (sees the SHA already +running → no-op). **Never** `set image`/`rollout restart` operator-managed +StatefulSets (memory id=740). Onboarding tool: `scripts/offinfra-onboard` + +`scripts/offinfra-templates/`; mirror + workflow commits via the Forgejo API over +the internal Traefik LB (`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`). +Reference impls: tripit (the original pilot), f1-stream, job-hunter, tuya_bridge. -**Migrated to GHA** (9): Website, k8s-portal, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints -**Woodpecker-native owned-app build** (Forgejo registry, build->deploy in one `.woodpecker.yml`): tuya_bridge, job-hunter, f1-stream (extracted to viktor/f1-stream 2026-06-05; Woodpecker repo id 166; the old github source is archived + its GHA repo-id-10 deactivated) -**Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access) +**Migrated apps (issues #13–#27):** f1-stream, job-hunter, tuya_bridge, +beadboard, nextcloud-todos, claude-agent-service, **claude-memory-mcp** (GHA → +ghcr, NOT DockerHub), kms-website, Freedify, instagram-poster, payslip-ingest, +broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder, +x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website, +apple-health-data, audiblez-web, plotting-book, insta2spotify, +audiobook-search) now also land on ghcr. +- **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service, + claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, + chrome-service-novnc, android-emulator. +- **PRIVATE ghcr:** f1-stream, job-hunter, instagram-poster, payslip-ingest, + wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, + infra-ci, k8s-portal. Pulled via the Kyverno-synced `ghcr-credentials` allowlist + (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; NOT cluster-wide; cred + = Vault `secret/viktor/ghcr_pull_token`, a dedicated classic PAT scoped to + `read:packages` (UI-minted 2026-06-15; no longer the admin `github_pat` + alias). GitHub has no token-mint API, so rotation is manual: re-mint → + `vault kv patch secret/viktor ghcr_pull_token=…` → targeted apply + `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault, dodges the + git-crypt tls-secret-sync landmine), Kyverno re-syncs the allowlist). -**Per-project files**: -- `.github/workflows/build-and-deploy.yml` — GHA: checkout, build, push DockerHub, POST Woodpecker API -- `.woodpecker/deploy.yml` — Woodpecker: `kubectl set image` + Slack notify (event: `[manual, push]`) -- `.woodpecker/build-fallback.yml` — Old full build pipeline preserved (event: `deployment` — never auto-fires) +**Infra-owned images (issues #29/#30)** build on GHA workflows IN the infra +repo's own `.github/workflows/` (added to the GitHub lineage via PR; the +github↔forgejo divergence was deliberately NOT reconciled): +`build-chrome-service-novnc.yml` + `build-android-emulator.yml` → public ghcr; +`build-cli.yml` → DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli`; +`build-infra-ci.yml` → `ghcr.io/viktorbarzin/infra-ci`; `build-k8s-portal.yml` → +PRIVATE `ghcr.io/viktorbarzin/k8s-portal` (Keel-deployed; the LAST in-cluster +Woodpecker build, migrated 2026-06-13 — completes "no local builds"). **infra-ci** +is the image the `.woodpecker/default.yml` apply step + `drift-detection.yml` run +in (proven by pipelines 165/166). chatterbox-tts is already built by tripit's GHA → ghcr. +The Woodpecker `build-ci-image.yml` + `build-cli.yml` pipelines were REMOVED; +infra-ci break-glass is a manual `.woodpecker/breakglass-infra-ci.yml` (ghcr +pull-and-save to the registry VM). -**Woodpecker API**: Uses **numeric repo IDs** (`/api/repos/2/pipelines`), NOT owner/name paths (those return HTML). -Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79, council-complaints=TBD (f1-stream's old GHA-era github repo id 10 is deactivated; it's now a Woodpecker-native Forgejo build at repo id 166) +**Forgejo container registry: FROZEN + emptied** (issue #32 wiped all `viktor/*` +container packages). Break-glass-only now; nothing pushes. `forgejo-cleanup` +stays DRY_RUN. Pull-through caches on `10.0.20.10` are unchanged. Runbook: +`docs/runbooks/forgejo-registry-breakglass.md`. + +**Woodpecker now runs only:** per-app `deploy.yml` (manual, `kubectl set +image`), `default.yml` (terragrunt apply), `renew-tls.yml` (certbot), +maintenance crons (drift-detection, provision-user, registry-config-sync, +pve-nfs-exports-sync, issue-automation, postmortem-todos), and the +manual `breakglass-infra-ci.yml`. **No build/test pipeline on any repo — do not +(re)introduce one.** (`.woodpecker/k8s-portal.yml`, the last in-cluster image +build, was removed 2026-06-13 — k8s-portal now builds on GHA → ghcr, see +Infra-owned images above.) + +**Decommissioned (issue #31):** travel_blog (stack destroyed + dir removed), 6 +dead builders' pipelines (terminal-lobby, webhook-handler, hmrc-sync, +trading-bot, travel-agent, trip-planner), and all `build-fallback.yml` files +(only Website had one). + +**Woodpecker API**: numeric repo IDs (`/api/repos//pipelines`), NOT +owner/name (those return HTML). The deploy registration for each app is the +**GitHub mirror** repo (github-forge). Infra: Forgejo forge = repo 82, legacy +GitHub forge = repo 1. **Woodpecker YAML gotchas**: - Commands with `${VAR}:${VAR}` must be **quoted** — unquoted `:` triggers YAML map parsing when vars are empty - Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility issues) - Global secrets must have `manual` in their events list for API-triggered pipelines -**GitHub repo secrets** (set on all repos): `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`, `WOODPECKER_TOKEN` - -**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build), `provision-user.yml` — all stay on Woodpecker. +**GitHub repo secrets** (per repo): `WOODPECKER_TOKEN` (POST deploy pipeline), +`FORGEJO_GIT_TOKEN` (write:repository PAT for the svu tag push). ghcr push uses +the workflow's built-in `GITHUB_TOKEN` (`packages: write`). ## Database Host -**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service has no endpoints — never use it. This variable is shared by ~12 stacks. +**`postgresql_host`** in `config.tfvars` is `pg-cluster-rw.dbaas.svc.cluster.local` (the CNPG primary). The legacy `postgresql.dbaas` service is a live compatibility alias (selector `cnpg.io/instanceRole=primary`, so it also reaches the primary — authentik's PgBouncer still points at it) — but use `pg-cluster-rw` for anything new. This variable is shared by ~12 stacks. **CNPG tuning** (in `stacks/dbaas/modules/dbaas/main.tf`): `shared_buffers=512MB`, `work_mem=16MB`, `wal_compression=on`, `effective_cache_size=1536MB`, pod memory 2Gi. @@ -142,8 +202,8 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle - **Critical path services scaled to 3**: Traefik, Authentik, CrowdSec LAPI, PgBouncer, Cloudflared. - **PDBs**: minAvailable=2 on Traefik and Authentik. - **Fallback proxies**: basicAuth when Authentik is down, fail-open when poison-fountain is down. -- **CrowdSec bouncer**: graceful degradation mode (fail-open on error). -- **Rate limiting**: Return 429 (not 503). Per-service tuning: Immich/Nextcloud need higher limits. +- **CrowdSec enforcement is out-of-band** (no Traefik plugin/middleware — the dead Yaegi `crowdsec-bouncer-traefik-plugin` was removed on Traefik 3.7.5): banned IPs are dropped **in-kernel via nftables** by the `cs-firewall-bouncer` DaemonSet on **direct** hosts (drops in BOTH the `input` and `forward` hooks — Traefik is ETP=Local so client traffic is DNAT'd to the pod via `forward`; pulls ALL decisions incl. the ~31k CAPI blocklist), and **blocked at the Cloudflare edge** for **proxied** hosts (one `crowdsec_ban` Rules List + a zone WAF block rule, fed by the `crowdsec-cf-sync` CronJob in `rybbit` ns every 2 min — excludes CAPI). Zero per-request latency; **fails open** (LAPI down → no new bans, existing drops persist, legit traffic never blocked). Whitelist covers RFC1918 + tailnet + internal CIDRs. Full as-built: `docs/architecture/security.md`. +- **Rate limiting**: Return 429 (not 503). Per-service tuning via dedicated middleware + `skip_default_rate_limit` (default 10/s burst 50): Immich 1000/20000, ActualBudget 50/300 (app boot = ~70 parallel revalidations). - **Retry middleware**: 2 attempts, 100ms — in default ingress chain. - **Entrypoint transport timeouts** (`websecure` `respondingTimeouts`): `writeTimeout=0` (unlimited download duration), `readTimeout=3600s` (uploads ≤1h), `idleTimeout=600s`. These are **HARD total-duration caps**, not nginx-style per-read idle timeouts — a finite `writeTimeout` truncates *any* large download at that wall-clock mark (a prior `writeTimeout=60s` silently cut Immich videos at 60s). **Do NOT re-tighten `writeTimeout`**; keep `readTimeout` finite (slow-loris backstop) but ≥ longest expected upload. Full rationale: `docs/architecture/networking.md` → "Entrypoint Transport Timeouts". - **HTTP/3 (QUIC)**: Enabled on Traefik. Works for **direct (non-proxied) apps** via the dedicated LB IP below (ETP=Local). Proxied apps get QUIC at the Cloudflare edge. @@ -156,16 +216,19 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle |---------|--------------------------| | Nextcloud | MaxRequestWorkers=150, needs 8Gi limit (Apache transient memory spikes, see commit eb94144), very generous startup probe | | Immich | ML on SSD (CUDA), disable ModSecurity (breaks streaming), frequent upgrades. **`immich-machine-learning` MUST run with `MACHINE_LEARNING_MODEL_TTL > 0`** (set to `600` in `stacks/immich/main.tf`, env on the `immich-machine-learning` deployment). At `0`, no model ever unloads and onnxruntime's CUDA arena (OCR's dynamic input shapes inflate it to ~10 GB) is held forever on the **time-sliced T4 it shares with llama-swap/frigate/immich-server** — which has no VRAM isolation, so immich-ml starved llama-swap (qwen3-8b) and silently broke recruiter-responder triage for ~5 h on 2026-06-02 (post-mortem `docs/post-mortems/2026-06-02-immich-ml-ttl-gpu-oom-recruiter.md`). TTL>0 lets idle models (OCR, face — AND CLIP) free VRAM. The TTL is a single GLOBAL knob (no per-model pin), so CLIP would also unload after 600s idle; the `clip-keepalive` CronJob (`*/5 * * * *`, same stack) pings the CLIP textual encoder so smart-search stays warm without pinning the ad-hoc models. **Smart search has a SECOND warmth layer in Postgres** (don't conflate it with the ML model): the ~665MB vchord `clip_index` must stay resident in PG `shared_buffers`, else an ANN probe that lands on an evicted list pays a ~1.8s cold storage read vs ~4ms warm. The `postStart` hook prewarms it ONCE at pod start and `pg_prewarm.autoprewarm` only re-warms at *startup*, so the index decays out of cache over days under job buffer-pressure (observed ~33% resident after 9d uptime → slow context search, easily misattributed to the ML model). The `clip-index-prewarm` CronJob (`*/5`, same stack) re-runs `pg_prewarm('clip_index')` to pin it hot; `immich-search-probe` (`*/5`) measures live latency + residency → Pushgateway gauges (`immich_smart_search_db_seconds`, `immich_clip_index_cached_pct`) → alerts `ImmichSmartSearchSlow`/`ImmichClipIndexColdCache`/`ImmichSearchProbeStale` + cluster-health check #46 (`check_immich_search`). immich PG role is a superuser so the CronJobs can run `pg_prewarm`/`pg_buffercache`. **Video transcoding is GPU-accelerated**: `immich-server` is pinned to GPU node1 (nodeSelector `nvidia.com/gpu.present` + NoSchedule toleration + `gpu-workload` priority) with a time-sliced `nvidia.com/gpu=1` slice — the stock immich-server image's ffmpeg already ships h264/hevc_nvenc + NVDEC. Activated via `ffmpeg.accel=nvenc` + `accelDecode=true` in the **DB** system-config (`system_metadata` table, key `system-config`, JSONB — NOT Terraform; app config is DB-managed here like oauth/smtp). Direct DB edits need a pod **recreate** to reload (config is cached at boot; only API-driven changes broadcast a reload). **Streaming bitrate is capped** to keep 4K playback smooth on the contended HDD and over remote uplinks: `ffmpeg.maxBitrate=20000k` + `preset=medium` + `transcode=bitrate` (set 2026-06-01 — was uncapped `maxBitrate=0` + `ultrafast` + `targetResolution=original`, which produced 77–264 Mbps 4K transcodes that stuttered for every client, local and remote, since even a single stream needs ~10–13.5 MB/s off the shared `sdc` spindle). 4K resolution is preserved (`targetResolution=original`); originals are NEVER modified — only the `encoded-video/` streaming copy. To re-apply transcode settings to EXISTING videos (config changes only affect new/missing ones): delete the offenders' `asset_file` rows `WHERE type='encoded_video'` (derived/regenerable — never touches originals) then run videoConversion `force=false` (admin Jobs API → "Missing"); it regenerates them to the deterministic `.mp4` path at concurrency 1 (gentle on sdc). See `docs/runbooks/immich-transcode-bitrate.md`. If Immich is ever reinstalled fresh (not restored), re-set these keys (accel, accelDecode, **maxBitrate=20000k, preset=medium, transcode=bitrate**). Thumbnails/previews live on SSD NFS (sdb) — do NOT move to block storage (HDD sdc = slower + the contended IO domain). **Background-job concurrency is capped to protect sdc** (DB-managed system-config, `system_metadata` key `system-config`, JSONB `job.*.concurrency`; re-set on fresh install): `thumbnailGeneration=2`, `metadataExtraction=2`, `library=2` — these jobs read ORIGINALS off the HDD library. Left uncapped (were 8/4/4) a library-wide job (e.g. Duplicate Detection on 2026-06-01) fans the ML/thumbnail backfill out into a read storm that saturates sdc and starves etcd → apiserver down. `sidecar`/`smartSearch`/`faceDetection` stay at Immich defaults (small `.xmp` / SSD previews). Apply via Job Settings UI or the `system-config` API; **direct DB edits need an `immich-server` pod recreate to reload** (config cached at boot). See `docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md`. | -| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob | +| CrowdSec | Pin version, disable Metabase when not needed (CPU hog), LAPI scaled to 3, **DB on PostgreSQL** (migrated from MySQL), flush config: max_items=10000/max_age=7d/agents_autodelete=30d, DECISION_DURATION=168h in blocklist CronJob. **Enforcement is out-of-band, NOT a Traefik plugin** (the Yaegi `crowdsec-bouncer-traefik-plugin` was dead on Traefik 3.7.5 and removed): `cs-firewall-bouncer` DaemonSet drops in-kernel via nftables on direct hosts (bouncer key `firewall`, v0.0.34 binary fetched at runtime, hostNetwork+NET_ADMIN, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`); `crowdsec-cf-sync` CronJob blocks at the CF edge for proxied hosts (bouncer key `kvsync`, `stacks/rybbit/crowdsec_edge.tf`). Both fail open. See `docs/architecture/security.md` | | Frigate | GPU stall detection in liveness probe (inference speed check), high CPU | -| Authentik | 3 replicas, PgBouncer in front of PostgreSQL, strip auth headers before forwarding | +| Authentik | 3 server replicas + 2-replica embedded outpost (PG-backed sessions), PgBouncer in front of PostgreSQL, strip auth headers before forwarding. **`authentik.*` Helm values are INERT** (existingSecret skips chart env rendering) — tune via `server.env`/`worker.env` in `modules/authentik/values.yaml`. Single-screen login (password embedded in identification stage); all first-party OIDC apps use implicit consent (2026-06-10). `/static` ingress carve-out serves assets with immutable Cache-Control. | | Kyverno | failurePolicy=Ignore to prevent blocking cluster, pin chart version | | MySQL Standalone | Raw `kubernetes_stateful_set_v1` pinned to `mysql:8.4.8` exactly (migrated from InnoDB Cluster 2026-04-16; **pinned to 8.4.8 on 2026-05-18** after Keel-driven `mysql:8.4` → 8.4.9 bump stalled the DD upgrade and required a full PVC-wipe + dump-restore — see `docs/runbooks/restore-mysql.md` and beads code-eme8/code-k40p). `skip-log-bin`, `innodb_flush_log_at_trx_commit=2`, `innodb_doublewrite=ON`. ConfigMap `mysql-standalone-cnf`. PVC `data-mysql-standalone-0` (5Gi initial → 30Gi via autoresizer, `proxmox-lvm-encrypted`). Service `mysql.dbaas` unchanged. Anti-affinity excludes k8s-node1. Bitnami charts deprecated (Broadcom Aug 2025) — use official images. | | phpIPAM | IPAM — no active scanning. `pfsense-import` CronJob (hourly) pulls Kea leases + ARP via SSH. `dns-sync` CronJob (15min) bidirectional sync with Technitium. Kea DDNS on pfSense handles all 3 subnets. API app `claude` (ssl_token). | ## Monitoring & Alerting -- Alert cascade inhibitions: if node is down, suppress pod alerts on that node. -- Exclude completed CronJob pods from "pod not ready" alerts. +- **Alert-on-change routing** (alert-noise-reduction 2026-06-12, `route` block in `prometheus_chart_values.tpl`): warning/info notify ONCE then stay quiet while firing (`repeat_interval: 8760h` ≈ off); criticals re-ping every 6h (was 1h); `send_resolved` on. Standing state is reviewed via the daily digest, not re-pings. +- **Daily alert digest**: CronJob `alert-digest` (monitoring ns, `alert_digest.tf` + `alert_digest.py`) posts the full current board grouped by severity + resolved-in-24h to `#alerts` at 08:00 Europe/London. Stock `python:3.12-alpine`, pure-stdlib (no pip/apk at runtime — avoids the status-page-pusher disk anti-pattern, id=559); reads Alertmanager v2 + Prometheus; reuses the Alertmanager Slack webhook via the `alert-digest` Secret. Safety net for alert-on-change. +- **Cascade inhibitions** (`inhibit_rules`): `NodeDown` AND `NodeConditionBad`/`NodeDiskPressure` suppress downstream pod-churn alerts (PodCrashLooping/PodImagePullBackOff/PodsStuckContainerCreating/ScrapeTargetDown/*ReplicasMismatch); `T3ProbeLegDown` suppresses `T3ProbeDropBurst` for the same `leg`; plus existing NFS/Traefik/Authentik/Power/Tuya/iDRAC cascades. No `equal` on the node rules (pod alerts carry no `node` label → cluster-wide, like NodeDown). +- **ScrapeTargetDown scrapes only Ready endpoints** (relabel `keep __meta_kubernetes_endpoint_ready=true` on both `kubernetes-service-endpoints` jobs) — completed CronJob pods lingering as NotReady EndpointSlice addresses no longer fire phantom "down" alerts (tts/tripit/beads, id=4895). Replaces the old "exclude completed CronJob pods" guidance; a Ready pod with a broken metrics endpoint still fires. +- Alertmanager is now scraped (`extraScrapeConfigs` job `alertmanager`) → `alertmanager_notifications_total`/`_alerts`/`_notifications_failed_total` available; it had no `prometheus.io/scrape` annotation so notification volume was previously unmeasurable. - Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions. - **External monitoring**: `[External] ` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param). - Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence, ImmichSmartSearchSlow (context-search latency / clip_index cache eviction). @@ -177,7 +240,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle Plan in `docs/architecture/security.md` + response playbook in `docs/runbooks/security-incident.md`. Beads epic: `code-8ywc`. - **Identity allowlist for security rules**: ONLY `me@viktorbarzin.me`. NOT `viktor@viktorbarzin.me`, NOT `emo@viktorbarzin.me` (those don't exist). emo's identity scheme is unknown — ask before assuming. -- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. +- **Source-IP allowlist (K2, K9, V7, S1)**: `10.0.20.0/22`, `192.168.1.0/24` (Proxmox + Sofia LAN), K8s pod CIDR, K8s service CIDR, Headscale tailnet. **Policy: no public-IP access** — Vault, kube-apiserver, PVE sshd must transit LAN or Headscale. **One documented exception (2026-06-11): break-glass SSH** — PVE sshd on a WAN-exposed `:52222`, key-only, dedicated break-glass key only (`Match LocalPort`), rate-limited + fail2ban; intentionally cluster-independent so it survives an outage. As-built `docs/runbooks/breakglass-ssh.md`. (Replaced the 2026-05-30 port-knock design — circular Vault dep caused a lockout.) - **Response model**: (I) Slack-only daily skim. All security alerts via Loki ruler → Alertmanager → `#security` Slack receiver. Single channel with severity labels inside (critical/warning/info). No paging. - **Kyverno policies (wave 1)**: `deny-privileged-containers`, `deny-host-namespaces`, `restrict-sys-admin`, `require-trusted-registries` flip Audit→Enforce with the 31-namespace exclude list (memory id=1970). `failurePolicy: Ignore` preserved. Cosign `verify-images` deferred. - **NetworkPolicy default-deny egress (wave 1)**: observe-then-enforce (γ approach) — Calico flow logs cluster-wide + GlobalNetworkPolicy log-only on tier 3+4, build empirical allowlist after 1 week, phased per-namespace enforce starting `recruiter-responder`. Tier 0/1/2 deferred. @@ -292,6 +355,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" { - `/usr/local/bin/daily-backup` — Daily 05:00. Mounts LVM thin snapshots ro → rsyncs FILES to `/mnt/backup/pvc-data////` with `--link-dest` versioning (4 weeks). Auto SQLite backup (magic number check, `?mode=ro`). Also backs up pfSense (config.xml + tar), PVE config. Prunes snapshots >7d. **Skip-list (2026-06-01)**: `nextcloud/nextcloud-data-proxmox` (orphaned pre-encryption PV). - `/usr/local/bin/offsite-sync-backup` — Daily 06:00 (After=daily-backup). Step 1: sda → Synology `pve-backup/` (incremental via manifest; monthly full `rsync --delete` days 1–7). Step 2: NFS direct → Synology — **immich-only on BOTH `nfs/` and `nfs-ssd/` (2026-06-01)**; ollama/llamacpp on the SSD no longer ship offsite. - `/usr/local/bin/lvm-pvc-snapshot` — Daily 03:00. Thin snapshots of all PVCs except dbaas+monitoring. 7-day retention. Instant restore: `lvm-pvc-snapshot restore `. +- `/usr/local/bin/vzdump-vms` — Daily 01:00. Live `vzdump --mode snapshot` of hand-managed VMs (the ones NOT in Terraform) → `/mnt/backup/vzdump/`, keep 3 per VMID. `VZDUMP_VMIDS` default `102` (devvm) — **the only VM imaged today** (its per-user home dirs + local-only git repos, incl. the no-remote monorepo root, are otherwise irreplaceable). devvm has the guest agent (`agent: 1`) so dumps are fs-consistent. Deliberately NOT in the incremental offsite manifest (would balloon Synology); the monthly offsite full pass (days 1-7) mirrors `/mnt/backup/vzdump/`. Pushgateway job `vzdump-backup`. Added 2026-06-09 (closed the silent "VMs never imaged" DR gap). Restore: `qmrestore /mnt/backup/vzdump/vzdump-qemu--.vma.zst `. - `nfs-change-tracker.service` — Continuous inotifywait on `/srv/nfs` + `/srv/nfs-ssd`. Logs changed file paths to `/mnt/backup/.nfs-changes.log`. Consumed by offsite-sync-backup for incremental rsync (completes in seconds instead of 30+ minutes). **Synology layout** (`192.168.1.13:/volume1/Backup/Viki/`): @@ -316,7 +380,7 @@ resource "kubernetes_persistent_volume_claim" "data_encrypted" { - **CrowdSec Helm upgrade times out**: `terragrunt apply` on platform stack causes CrowdSec Helm release to get stuck in `pending-upgrade`. Workaround: `helm rollback crowdsec -n crowdsec`. Root cause: likely ResourceQuota CPU at 302% preventing pods from passing readiness probes. Needs investigation. - **OpenClaw config is writable**: OpenClaw writes to `openclaw.json` at runtime (doctor --fix, plugin auto-enable). Never use subPath ConfigMap mounts for it — use an init container to copy into a writable volume. Needs 2Gi memory + `NODE_OPTIONS=--max-old-space-size=1536`. **`mcp.servers` baked into the ConfigMap-loaded openclaw.json gets stripped by `doctor --fix`** — register MCP servers via `openclaw mcp set ` in the container startup command instead (CLI-written entries persist across doctor runs). Current servers wired this way: `ha`, `context7`, `playwright` (sidecar at `localhost:3000/mcp`). - **OpenClaw memory-core indexes `/workspace/memory/`, not `/home/node/.openclaw/memory/`**: `/home/node/.openclaw/memory/main.sqlite` is the index store, NOT a content source. Files written under `/home/node/.openclaw/memory/projects//*.md` will NOT be indexed. To populate memory-core, write Markdown under `/workspace/memory/projects//` and run `openclaw memory index --force`. This is what the daily `memory-sync` CronJob in `stacks/openclaw/` does for claude-memory → OpenClaw sync. -- **Goldilocks VPA sets limits**: When increasing memory requests, always set explicit `limits` too — Goldilocks may have added a limit that blocks the change. +- **(Obsolete 2026-06-12) Goldilocks VPA**: VPA/Goldilocks was uninstalled (etcd-load-reduction); the old "Goldilocks may have added a limit that blocks the change" gotcha no longer applies. Use `krr` for right-sizing. ## User Preferences - **Calendar**: Nextcloud at `nextcloud.viktorbarzin.me` diff --git a/.claude/home-assistant-sofia.py b/.claude/home-assistant-sofia.py index b0ccdca7..d8121f6c 100644 --- a/.claude/home-assistant-sofia.py +++ b/.claude/home-assistant-sofia.py @@ -7,6 +7,7 @@ Control and query Home Assistant entities on ha-sofia.viktorbarzin.me. import argparse import json import os +import subprocess import sys from urllib.parse import urljoin @@ -17,13 +18,29 @@ except ImportError: print(" pip install requests") sys.exit(1) -# Configuration from environment variables (ha-sofia specific) -HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") -HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") -if not HA_URL or not HA_TOKEN: - print("ERROR: HOME_ASSISTANT_SOFIA_URL and HOME_ASSISTANT_SOFIA_TOKEN environment variables must be set.") - print("These should be set when activating the Claude venv (~/.venvs/claude)") +def _token_from_homelab(): + """Resolve the token via the homelab CLI when the env var isn't set, so the + script works from any directory / unprovisioned session (see ADR-0012).""" + try: + out = subprocess.run( + ["homelab", "ha", "token", "--instance", "sofia"], + capture_output=True, text=True, timeout=30) + if out.returncode == 0 and out.stdout.strip(): + return out.stdout.strip() + except Exception: + pass + return None + + +# Configuration: prefer env vars (set by the Claude venv); otherwise fall back to +# defaults + the homelab CLI so the script is not cwd/env dependent (ADR-0012). +HA_URL = os.environ.get("HOME_ASSISTANT_SOFIA_URL", "").rstrip("/") or "https://ha-sofia.viktorbarzin.me" +HA_TOKEN = os.environ.get("HOME_ASSISTANT_SOFIA_TOKEN") or _token_from_homelab() + +if not HA_TOKEN: + print("ERROR: no ha-sofia API token available.") + print("Set HOME_ASSISTANT_SOFIA_TOKEN, or ensure `homelab ha token` works (kubeconfig reachable).") sys.exit(1) HEADERS = { diff --git a/.claude/reference/authentik-state.md b/.claude/reference/authentik-state.md index 1adb9176..2ff86141 100644 --- a/.claude/reference/authentik-state.md +++ b/.claude/reference/authentik-state.md @@ -5,17 +5,26 @@ ## Applications (11) | Application | Provider Type | Auth Flow | |-------------|--------------|-----------| -| Cloudflare Access | OAuth2/OIDC | explicit consent | +| Cloudflare Access | OAuth2/OIDC | implicit consent | | Domain wide catch all | Proxy (forward auth) | implicit consent | -| Forgejo | OAuth2/OIDC | explicit consent | +| Forgejo | OAuth2/OIDC | implicit consent | | Grafana | OAuth2/OIDC | implicit consent | -| Headscale | OAuth2/OIDC | explicit consent | -| Immich | OAuth2/OIDC | explicit consent | +| Headscale | OAuth2/OIDC | implicit consent | +| Immich | OAuth2/OIDC | implicit consent | | Kubernetes | OAuth2/OIDC (public) | implicit consent | | Kubernetes Dashboard | OAuth2/OIDC (confidential) | implicit consent | -| linkwarden | OAuth2/OIDC | explicit consent | +| linkwarden | OAuth2/OIDC | implicit consent | +| Vault | OAuth2/OIDC | implicit consent | | wrongmove | OAuth2/OIDC | implicit consent | +> **2026-06-10 — every provider now uses implicit consent.** Cloudflare +> Access (pk 9), Forgejo (20), Immich (1), Headscale (13), linkwarden (8) +> and Vault (53) were switched from +> `default-provider-authorization-explicit-consent` via the API (these +> providers are UI-managed, not in TF). All are first-party apps; the +> expiring consent screen (re-shown every 4 weeks per app) only slowed +> first-time signin. + > **Kubernetes Dashboard** (TF-managed in `stacks/k8s-dashboard/authentik.tf`): > confidential client `k8s-dashboard`, built for seamless dashboard SSO via > oauth2-proxy. **Currently IDLE** — the apiserver rejects all OIDC tokens (see @@ -60,8 +69,27 @@ - All sources use `invitation-enrollment` as enrollment flow (new users require invitation) ## Authorization Flows -- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen -- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects +- **Explicit consent** (`default-provider-authorization-explicit-consent`): Shows consent screen — no provider uses it since 2026-06-10 +- **Implicit consent** (`default-provider-authorization-implicit-consent`): Auto-redirects — used by ALL providers + +## Authentication Flow (single-screen login, 2026-06-10) + +`default-authentication-flow` bindings: identification (order 10) → +mfa-validation (order 30) → user-login (order 100). The identification +stage (`default-authentication-identification`, pk +`32aca5ab-106e-43f4-a4cc-4513d80e57f3`) has `password_stage` set to +`default-authentication-password`, so username + password render on ONE +screen (one round trip instead of two). The previously separate +password-stage binding at order 20 (pk `0fc677db-a23f-4ee7-8648-da342e14573b`) +was DELETED via the API — authentik requires removing it when the +identification stage embeds the password field. `password_stage` is pinned in +Terraform (`authentik_stage_identification.default_identification` in +`stacks/authentik/authentik_provider.tf`); all other stage fields stay +UI-managed via `ignore_changes`. Social-login buttons remain on the same +screen and bypass the password field, so Google/GitHub/Facebook users are +unaffected. If a future authentik upgrade/blueprint re-adds the order-20 +binding, users would briefly see a second password prompt — delete the +binding again. ## Invitation Enrollment Flow Slug: `invitation-enrollment` | PK: `7d667321-2b02-4e16-8161-148078a8dac1` @@ -138,7 +166,8 @@ Pinned via Terraform in `stacks/authentik/`: | Knob | Value | Surface | Effect | |------|-------|---------|--------| -| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. | +| `UserLoginStage.session_duration` on `default-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_login` in `authentik_provider.tf` | Authenticated users stay logged in 4 weeks across browser restarts. No sliding refresh — resets on each login. Used by password login (`default-authentication-flow`) AND passkey login (`webauthn` flow — both terminate on this stage). | +| `UserLoginStage.session_duration` on `default-source-authentication-login` | `weeks=4` | `authentik_stage_user_login.default_source_login` in `authentik_provider.tf` (imported 2026-06-20, id `4c6977d2-…`) | **Social logins** (Google/GitHub/Facebook, via `default-source-authentication-flow`). Was the provider default `seconds=0`, which fell back to `UNAUTHENTICATED_AGE=hours=2` — so social logins expired every **2h** while password/passkey lasted 4 weeks. Pinned `weeks=4` on 2026-06-20 to make all login paths consistent. (Surfaced when the 2026-06-18 passkey wipe forced fallback to Google login → "re-login multiple times daily".) | | `ProxyProvider.access_token_validity` on `Provider for Domain wide catch all` | `weeks=4` | `authentik_provider_proxy.catchall.access_token_validity` in `authentik_provider.tf` | Cookie `Max-Age` on `authentik_proxy_*` and `expires` on rows in `authentik_providers_proxy_proxysession`. Bumped 2026-05-10 from `hours=168`. **Bumping requires `kubectl rollout restart deploy/ak-outpost-authentik-embedded-outpost`** — the gorilla session store binds the value once at outpost startup; the 5-min provider refresh logs `"reusing existing session store"` and skips rebuild. | | `AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE` (server + worker) | `hours=2` | `server.env` + `worker.env` in `modules/authentik/values.yaml` | Anonymous Django sessions (bots, healthcheckers, partial flows) are reaped within 2h instead of the 1d default. | @@ -149,7 +178,19 @@ Notes: - The standalone embedded-outpost deployment needs `AUTHENTIK_POSTGRESQL__{HOST,PORT,USER,PASSWORD,NAME}` env vars to reach the dbaas cluster — codified via `kubernetes_json_patches.deployment` envFrom the shared `goauthentik` Secret. The `app.kubernetes.io/component=server` pod label is also injected via JSON patch (matches the `component:server` half of the Service selector that the controller adds for embedded outposts). - `ProxyProvider.remember_me_offset` stays UI-managed via `ignore_changes`. - The Authentik provider's resource schema does **not** expose the `Outpost.managed` field. We rely on TF's "write only fields it knows about" semantic: the server-set `goauthentik.io/outposts/embedded` value is preserved across applies because Terraform never writes `managed`. Don't change the resource provider schema expectations without verifying this assumption holds. -- The `unauthenticated_age` env var is injected via `server.env` / `worker.env` (not `authentik.sessions.unauthenticated_age`) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. The same applies to the existing `authentik.cache.*`, `authentik.web.*`, `authentik.worker.*` blocks (currently inert; live values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced). + +## WebAuthn / Passkeys (2026-06-20) + +- **Passkey devices live in the DB, NOT Terraform** (`WebAuthnDevice` model). They are user-owned; no TF resource or blueprint manages them. Re-enroll via the user settings UI (Authentik → Settings → MFA Devices → register a security key / passkey). +- **2026-06-18 wipe (root cause of the "WebAuthn broke" incident):** all 6 of Viktor's passkeys were deleted (`WebAuthnDevice.objects.count()` → 0) at 19:27 by an **ad-hoc tripit passkey E2E test** run from the devvm (`python-httpx/0.28.1`, as `akadmin`). The test cleanup did `GET /core/users/?search={demo}` (a **fuzzy** search) then `DELETE /api/v3/authenticators/admin/webauthn/{pk}/` for each device of `users[0]` — but `users[0]` resolved to the **real** account, not the intended demo user. **Lesson:** any future passkey-test cleanup MUST exact-match the demo user (`username == demo`), never `users[0]` of a fuzzy `?search=`. It was a one-off ad-hoc script (no committed/scheduled copy), so nothing auto-re-deletes — re-enrollment is safe. +- **Passkey login path itself is intact:** the identification stage's `passwordless_flow` → `webauthn` flow (UI-managed, in `ignore_changes`); the break was purely the missing device records. +- **Provider-schema gotcha:** the pinned authentik TF provider's `authentik_stage_identification` resource exposes **no** `webauthn_stage` or `enable_remember_me` attribute (they exist on the app *model*, not in the provider schema). Do NOT add them to `ignore_changes` — `tg plan` errors `Unsupported attribute`. They are purely UI/app-managed. (Commit `4e882989` removed them for exactly this reason; re-adding breaks every apply.) +- ALL tuned env vars are injected via `server.env` / `worker.env` (not the `authentik.*` values block) because we set `authentik.existingSecret.secretName: goauthentik`, which makes the chart skip rendering its own `AUTHENTIK_*` Secret. The `authentik.*` value block is therefore inert in this stack — anything new under `authentik.*` must use the `*.env` arrays instead. Live base values come from the orphaned, helm-keep-policy `goauthentik` Secret created by chart 2025.10.3 before `existingSecret` was introduced. **2026-06-10:** the previously-inert tuning (`AUTHENTIK_WEB__WORKERS=3`, `AUTHENTIK_WEB__THREADS=4`, `AUTHENTIK_CACHE__TIMEOUT_FLOWS=1800`, `AUTHENTIK_CACHE__TIMEOUT_POLICIES=900`, `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60`, `AUTHENTIK_POSTGRESQL__CONN_HEALTH_CHECKS=true`, worker `AUTHENTIK_WORKER__THREADS=4`) was moved into the env arrays and is now actually live — before that, pods silently ran defaults (2 gunicorn workers, 300s caches, no persistent DB conns). +- **Outpost (2026-06-10):** `log_level=info` (was `trace` — per-request overhead on the forward-auth hot path) and `kubernetes_replicas=2` (was 1 — single-pod hot path; safe since proxy sessions live in Postgres). Both in `authentik_outpost.embedded` config. +- **Image tag is PINNED in values (`global.image.tag`), 2026-06-10:** Keel moves the authentik image between chart releases, while helm derives the tag from the chart appVersion — an unpinned helm apply silently DOWNGRADES live pods (caused the 2026-06-10 boot storm + shared-PG failover; see `docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md`). Before touching this chart, check the live image tag and refresh the pin. +- **Liveness budget (2026-06-10):** `server.livenessProbe` = 6×10s, 5s timeout (chart default 3×10s/3s kill-loops pods that queue on the DB migration advisory lock during rolling restarts). +- **PgBouncer (2026-06-10):** `idle_transaction_timeout=300` reaps ghost `idle in transaction` sessions (a killed pod mid-migration otherwise holds the migration advisory lock forever, serializing all boots); the deployment carries a config-checksum annotation so ini changes roll the pods. Do NOT set `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE` — session-mode PgBouncer pins persistent conns 1:1 (pool saturation). +- **Static assets (2026-06-10):** a second `ingress_factory` (`module.ingress-static`, path `/static` on the authentik host) attaches the `authentik-static-cache-headers` middleware → `Cache-Control: public, max-age=31536000, immutable`. Authentik itself serves no max-age; assets are version-fingerprinted so immutable is safe. Mainly helps split-horizon internal users (no Cloudflare edge cache on the direct path). ## Upgrade Validation Checklist @@ -161,8 +202,9 @@ Run after **any** of these: The fragile surfaces are the `kubernetes_json_patches` and the `Outpost.managed` field — both rely on assumptions that can silently break across upgrades. The checklist exercises the same path the alerts watch, so it doubles as a smoke test for the alerts. ```bash -# 1. Service routes to the outpost pod (NOT the server pods). -# Empty endpoints => auth-proxy fallback fires; expected: ONE pod IP, ports 9000/9300/9443. +# 1. Service routes to the outpost pods (NOT the server pods). +# Empty endpoints => auth-proxy fallback fires; expected: TWO pod IPs +# (kubernetes_replicas=2 since 2026-06-10), ports 9000/9300/9443. kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost # 2. Service selector still excludes the server pods. Expected: includes diff --git a/.claude/reference/proxmox-inventory.md b/.claude/reference/proxmox-inventory.md index f2f53758..5e308aee 100644 --- a/.claude/reference/proxmox-inventory.md +++ b/.claude/reference/proxmox-inventory.md @@ -92,19 +92,21 @@ Channel 3: A4 [32G] ──── A8 [32G] ──── A12[ 8G ] = 72 GB | VMID | Name | Status | CPUs | RAM | Network | Disk | Notes | |------|------|--------|------|-----|---------|------|-------| | 101 | pfsense | running | 8 | 4GB | vmbr0, vmbr1:vlan10, vmbr1:vlan20 | 32G | Gateway/firewall | -| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 8G swapfile (swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. | +| 102 | devvm | running | 16 | 24GB | vmbr1:vlan10 | 100G | Development VM + t3code Workstation host. 14G swap (8G /swapfile + 6G /swapfile2, grown 2026-06-10; swappiness=10). Capacity budget: ~4-5G RAM/active user, max ~3-4 concurrent active Claude sessions. NOT Terraform-managed. Disk controller: `virtio-scsi-single` + `scsi0 iothread=1,aio=threads` staged 2026-06-11 after the QEMU I/O stall (was `scsihw: lsi`, the only VM on the legacy path — see `docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md`); applies at next cold stop→start. | | 103 | home-assistant | running | 8 | 8GB | vmbr0 | 64G | HA Sofia, net0(vlan10) disabled, SSH: vbarzin@192.168.1.8 | | 105 | pbs | stopped | 16 | 8GB | vmbr1:vlan10 | 32G | Proxmox Backup (unused) | -| 200 | k8s-master | running | 8 | 16GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) | -| 201 | k8s-node1 | running | 16 | 32GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 | -| 202 | k8s-node2 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker | -| 203 | k8s-node3 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker | -| 204 | k8s-node4 | running | 8 | 24GB | vmbr1:vlan20 | 256G | Worker | +| 200 | k8s-master | running | 8 | 32GB | vmbr1:vlan20 | 64G | Control plane (10.0.20.100) | +| 201 | k8s-node1 | running | 16 | 48GB | vmbr1:vlan20 | 256G | GPU node, Tesla T4 | +| 202 | k8s-node2 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker | +| 203 | k8s-node3 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker | +| 204 | k8s-node4 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker | +| 205 | k8s-node5 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.105, joined 2026-05-26) | +| 206 | k8s-node6 | running | 8 | 32GB | vmbr1:vlan20 | 256G | Worker (10.0.20.106, joined 2026-05-26) | | 220 | docker-registry | running | 4 | 4GB | vmbr1:vlan20 | 64G | MAC DE:AD:BE:EF:22:22 (10.0.20.10) | | 300 | Windows10 | running | 16 | 8GB | vmbr0 | 100G | Windows VM | | ~~9000~~ | ~~truenas~~ | **stopped/decommissioned** | — | — | — | — | NFS migrated to Proxmox host (192.168.1.127) at `/srv/nfs` and `/srv/nfs-ssd` | -**Total VM RAM allocated**: 196 GB of 272 GB (72%) — 76 GB free for future VMs (devvm corrected 8GB→24GB 2026-06-08) +**Total VM RAM allocated**: ~288 GB nominal across running VMs vs 272 GB physical — OVERCOMMITTED (ballooning enabled on K8s workers, host swap in use; see memory id=535/2543). K8s rows live-verified via `kubectl get nodes` capacity 2026-06-11 (master 32G, node1 48G, node2-6 32G; the old 16/32/24GB figures predated the 2026-04-02 resize and node5/6). ## VM Templates | VMID | Name | Purpose | diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 633b227f..cd7b5274 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -32,7 +32,7 @@ |---------|-------------|-------| | k8s-dashboard | Kubernetes dashboard at `k8s.viktorbarzin.me`. **Forward-auth + auto-injected SA token** (apiserver OIDC blocked, see design §12). nginx token-injector (`dashboard_injector.tf`) maps `X-authentik-username` → the user's `dashboard-` SA token (ns admin + read-only on namespace-list/nodes only via `dashboard-nav-readonly` — no cross-tenant reads, `rbac/.../dashboard-sa.tf`; admins → cluster-admin SA) and sets `Authorization: Bearer` → no token-paste, dashboard auto-authenticates per user. Forward-auth admits `kubernetes-*` groups for this host (`stacks/authentik/admin-services-restriction.tf`). oauth2-proxy + `k8s-dashboard` OIDC app built but idle. | k8s-dashboard | | reverse-proxy | Generic reverse proxy | reverse-proxy | -| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 binary tracks `nightly`** via `t3-autoupdate` (daily systemd timer; health-check + auto-rollback on a bad build; restarts only idle instances) — so new models (e.g. Opus 4.8) land as t3 ships them. Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. | t3code | +| t3code | Multi-user coding-agent GUI at t3.viktorbarzin.me. `auth=required` (Authentik) → DevVM `t3-dispatch` service (`10.0.10.10:3780`, unprivileged user) maps `X-authentik-username` → that user's own `t3-serve@` instance (file perms enforced by uid; wizard→:3773, emo→:3774; unmapped→403) and **auto-injects the t3 session on first visit** (mints via the root `t3-mint` wrapper, scoped sudoers → `/api/auth/bootstrap` `t3_session` cookie). **Source of truth = `infra/scripts/workstation/roster.yaml`** (os_user → authentik_user/k8s_user/tier/namespaces); `roster_engine.py` (pytest-covered) derives desired state and `t3-provision-users` (hourly systemd timer) applies it — constrained accounts, additive per-tier groups, `t3-serve@` instances, and **regenerating** `/etc/ttyd-user-map` + `dispatch.json` (those two are now GENERATED — do not hand-edit). New non-admins inherit wizard's Claude config (machine-wide managed `claudeMd` in `/etc/claude-code/managed-settings.json` + per-user `~/.claude/{skills,rules}` symlinks seeded by `/etc/skel`) and get a **writable git-crypt-LOCKED** infra clone at `~/code` (code plaintext, secret files ciphertext). Tiers: admin / power-user (cluster-wide read-only) / namespace-owner. **Add a user:** one entry in `roster.yaml` → reconcile. Per-user OIDC kubeconfig, the `oidc-power-user-readonly` ClusterRole, and the Authentik `T3 Users` edge gate are applied (the gate is live — only `T3 Users` members reach t3); the emo cutover to his own locked clone is the remaining gated step. DevVM artifacts versioned in `infra/scripts/` (`t3-serve@.service`, `t3-provision-users` + `workstation/{roster.yaml,roster_engine.py,setup-devvm.sh,managed-settings.json,skel/}`, `t3-dispatch/`, `t3-mint`, `sudoers-t3-autopair`, `t3-autoupdate.*`, `t3-safe-restart.sh`, `t3-migrate-idle.*`); TF (`stacks/t3code`) owns only the ingress + Endpoints→:3780. **t3 AUTO-TRACKS the `nightly` npm dist-tag** (Viktor 2026-06-16, reversing the post-2026-06-09 pin; churn risk accepted) — `t3-autoupdate` is a daily GATED tracker that follows `t3@nightly` but gates every bump so a bad build self-heals: downgrade-guard → pre-bump `VACUUM INTO` backup → health-check that SEEDS a copy of a real POPULATED `state.sqlite` to exercise the forward migration + the real mint→exchange→`t3_session` pairing handshake → canary-restart idle instances ONE AT A TIME with per-instance dispatch pairing verify → auto-rollback to last-good + self-freeze on failure (active-agent instances deferred, never killed; last-good in `/var/lib/t3-autoupdate/last-good`). **Deferred instances are drained overnight by `t3-migrate-idle.timer`** (every 20 min 01:00–05:40): it restarts a still-stale `t3-serve@` onto the current binary only when that user's `state.sqlite` shows no in-flight turn (`active_turn_id`) + ≥15 min quiet (`T3_MIGRATE_QUIET_SECONDS`), via the shared `t3-safe-restart.sh` (the same backup→restart→verify→recover helper the canary uses) — fixing the chronic skew where a user busy at every 04:00 window never migrated and saw "Client and server versions differ". The 2026-06-09 outage was the SAME nightly channel WITHOUT these gates. Freeze/revert now: `sudo touch /etc/t3-autoupdate.freeze` (or set `T3_PIN=` to hard-pin); preview a build with `T3_DRY_RUN=1`. Channel via `T3_TRACK` in `t3-autoupdate.sh` + `setup-devvm.sh` (keep in sync). Full ops + manual rollback: `docs/runbooks/t3-version-bump.md`. `t3-dispatch` is **version-agnostic** (2026-06-09): `autoPair` tries `/api/auth/browser-session` (0.0.25) then falls back to `/api/auth/bootstrap` (0.0.24), so 0.0.24↔0.0.25 needs no dispatch change. `~/.t3` is backed up daily by `t3-backup-state` (online `VACUUM INTO`; previously unbacked — it's the only copy). Native app/app.t3.codes unsupported (cross-origin) — deferred until published. Design: `docs/plans/2026-06-01-t3-auto-provision-*`. **Drop attribution (2026-06-10):** `t3-probe` Deployment (same ns) holds differential legs — `cloudflare` (full public path via DoH-pinned DNS), `internal` (Traefik LB only), `t3serve` (devvm:3773 direct) — against dispatch's unauthenticated `/probe` carve-out (walloff-guarded); Prometheus job `t3-probe`, alerts `T3ProbeLegDown`/`T3ProbeDropBurst`, runbook `docs/runbooks/t3-drop-attribution.md`. `t3-serve@` units carry memory containment (`MemoryHigh=12G/MemoryMax=16G/MemorySwapMax=0/OOMPolicy=continue`) so a runaway agent OOMs alone instead of freezing devvm. **Connection logs (2026-06-11):** `t3-dispatch` logs every `/ws` open/close with `dur_ms` + `cause` (`downstream_closed`=client/CF/Traefik hung up → last-mile; `upstream_closed`=t3-serve closed; `graceful`); devvm journald now ships to Loki via `scripts/devvm-promtail.*` (`{job="devvm-journal"}` + `{job="sshd-devvm"}`), joining Traefik `/ws`-duration + cloudflared close events already in Loki for full per-drop attribution without a repro. **Empirical (2026-06-11):** direct-to-t3-serve held one WS 40 min (0 drops) while a real tunnel session cycled 5×/90s → drop originates above t3-serve on the public path, NOT in t3-serve itself; `t3 auth pairing create`+`/api/auth/browser-session` works, and dispatch **auto-pair was re-verified healthy on the live pin 2026-06-16** (cookieless `X-authentik-username` → 302 + `t3_session`) — the earlier transient 401 note no longer reproduces, and the new dispatch pairing logs + `T3PairingBroken`/`T3PairFallbackHigh` Loki alerts now watch pairing continuously. | t3code | ## Active Use | Service | Description | Stack | @@ -41,13 +41,15 @@ | shadowsocks | Proxy | shadowsocks | | webhook_handler | Webhook processing | webhook_handler | | tuya-bridge | Smart home bridge | tuya-bridge | +| android-emulator | Shared Android 16 test emulator (adb 10.0.20.200:5555, noVNC android-emulator.viktorbarzin.lan) | android-emulator | +| anisette | Self-hosted Apple anisette-data server (Dadoum/anisette-v3-server, digest-pinned) for sideloading the TripIt iOS Shell via SideStore; internal-only http://anisette.viktorbarzin.lan, auth=none, LAN-only, stateless | anisette | | dawarich | Location history | dawarich | | owntracks | Location tracking | owntracks | | nextcloud | File sync/share | nextcloud | | calibre | E-book management (may be merged into ebooks stack) | calibre | | onlyoffice | Document editing | onlyoffice | -| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05), Woodpecker-native build->deploy (repo id 166) | f1-stream | -| chrome-service | Headed Chromium WebSocket pool (`ws://chrome-service.chrome-service.svc:3000/`) for sibling services driving anti-bot embeds | chrome-service | +| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); canonical source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05); GHA-built → `ghcr.io/viktorbarzin/f1-stream` (private), Woodpecker deploy-only (ADR-0002) | f1-stream | +| chrome-service | Headed Chromium over CDP (`http://chrome-service.chrome-service.svc:9222`, `connect_over_cdp`; legacy `:3000/` WS pool removed 2026-06-04) for sibling services driving anti-bot pages — snapshot-harvester CronJob + tripit fare scrape | chrome-service | | rybbit | Analytics | rybbit | | isponsorblocktv | SponsorBlock for TV | isponsorblocktv | | actualbudget | Budgeting (factory pattern) | actualbudget | @@ -55,7 +57,7 @@ | trading-bot | Event-driven trading with sentiment analysis | trading-bot | | claude-memory | Persistent memory MCP server | claude-memory | | paperless-mcp | Paperless-ngx document search MCP (barryw/PaperlessMCP). Traefik bearer auth via Aetherinox api-token-middleware. `auth=none` at ingress; gateway-level bearer enforced by `paperless-mcp/bearer-auth` Middleware CRD. Tokens + paperless API token in Vault `secret/paperless-mcp`. | paperless-mcp | -| council-complaints | Islington civic reporting pilot | council-complaints | +| paperless-ai | AI layer over Paperless-ngx (clusterzx/paperless-ai): semantic/RAG document search (Chat) + auto-tagging. Local embeddings (sentence-transformers MiniLM) + ChromaDB on the PVC — search is GPU-free. LLM (chat answers + tagging) via in-cluster llama-swap `qwen3-8b` (`SYSTEM_PROMPT=/no_think` to keep Qwen3 output parseable). `auth=required` (Authentik) at `paperless-ai.viktorbarzin.me`. Reads Paperless over the internal svc as a dedicated `paperless-ai` superuser. **Runtime config + app-admin live in the PVC `.env`/SQLite (written once via the app's setup flow), NOT TF env — its dotenv loader does not override `process.env`, so container env shadows the `.env`.** Vault `secret/paperless-ai` (paperless_api_token, api_key, custom_api_key, app_admin_*). | paperless-ai | ## Optional | Service | Description | Stack | @@ -93,7 +95,7 @@ | n8n | Workflow automation | n8n | | real-estate-crawler | Property crawler | real-estate-crawler | | tor-proxy | Tor proxy | tor-proxy | -| forgejo | Git forge | forgejo | +| forgejo | Git forge. Open native self-signup (Turnstile captcha + email confirm) + Authentik & GitHub OAuth sign-in; see `docs/runbooks/forgejo-open-signups.md` | forgejo | | freshrss | RSS reader | freshrss | | navidrome | Music streaming | navidrome | | networking-toolbox | Network tools | networking-toolbox | @@ -116,7 +118,7 @@ | status-page | Status page | status-page | | plotting-book | Book plotting/world-building app | plotting-book | | tripit | Self-hosted TripIt-clone travel-itinerary PWA (FastAPI + SvelteKit SPA, same-origin). CNPG (`tripit` db, Vault static role `pg-tripit`) + RWX NFS trip-doc vault (`/srv/nfs/tripit-documents`) + RWO `proxmox-lvm-encrypted` personal-document vault `tripit-personal-documents` (passports/IDs — AES-256-GCM app-layer envelope, master key `DOCUMENT_ENCRYPTION_KEY` in `secret/tripit`). `auth=required` (Authentik forward-auth, reads `X-authentik-email`); second `auth=none` ingress on `/api/calendar` for HMAC-token-gated `.ics` feed. Email-ingest CronJob `tripit-ingest-plans` (`*/15`) is the SOLE inbound path — forward a booking to plans@viktorbarzin.me (catch-all → spam@), polled read-only and routed ONLY to a registered user / verified linked address (no default-owner fallback; strangers ignored), parsed by local LLM (`qwen3vl-4b`), and the sender is emailed the outcome (Added to trip / Couldn't import). Plus `tripit-poll-flights`, `tripit-run-reminders`, `tripit-transport-nudge`, `tripit-weather-brief`. (The old Gmail-scrape `tripit-ingest-mail` CronJob was removed 2026-06-05.) App secrets in Vault `secret/tripit`. | tripit | -| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy is ON-DEMAND, no scheduled job** (deliberate — short-term content, avoid rotting artifacts): mirror Drive→NFS via a throwaway `rclone/rclone` container using the existing `google_workspace` OAuth creds in Vault `secret/viktor` (`google_workspace_mcp_token_json`) → rsync to `/srv/nfs/stem-site` (empty-source guard). Just ask Claude to "sync stem95su from Drive" (recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync still works as a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su | +| stem95su | STEM educational platform for **95. СУ „Проф. Иван Шишманов"** (Sofia school) at stem95su.viktorbarzin.me. Public **open** static site (`auth=none` — CrowdSec + ai-bot-block, no login). Stock `nginx:1.28-alpine` serving content **straight off PVE host NFS** `/srv/nfs/stem-site` (RWX `nfs_volume`, mounted read-only) — **NOT** image-baked, so the externally-authored (Gemini-exported) HTML/media updates with no rebuild; auto-backed-up offsite by `nfs-mirror`. **Content source = Google Drive folder "claude"** (id `1cmOI2jRyBJdnrVPgbr4kx2cx_4DY6pm_`, shared Valentina→vbarzin@gmail.com). **Deploy = scheduled mirror** (since 2026-06-09, reversed the earlier on-demand-only call once content went active): CronJob `stem95su-gdrive-sync` (`*/10`, `stacks/stem95su/gdrive-sync.tf`) mounts the content PVC RW and `rclone sync`s the Drive folder onto it (`docker.io/rclone/rclone:1.74.3`, `scope=drive.readonly` — Drive is READ-ONLY; empty-source guard + `--max-delete 25` so a partial listing can't wipe the site). rclone creds (OAuth refresh-token) in Vault `secret/stem95su` (`rclone_conf`) → ESO secret `stem95su-rclone`. **Requires the GCP OAuth app (project home-lab-1700868541205) published to "Production"** or the refresh token expires ~weekly (re-mint + `vault kv put secret/stem95su rclone_conf=…` after publishing); a dead token surfaces as a failed Job. Manual on-demand sync still possible (throwaway rclone container from devvm; recipe in claude-memory). Nextcloud "PVE NFS Pool"/rsync is a manual fallback. Dashboard `stem_board.html` served at `/` via a small nginx ConfigMap (`index`). No DB, no in-cluster secrets. Reference impl for the NFS-backed static-site pattern (see patterns.md). | stem95su | | trek | **TRIAL (2026-06-05)** — self-hosted group-trip planner (upstream [TREK](https://github.com/mauriceboe/TREK), `mauriceboe/trek:3.0.22`, AGPL-3.0). Solo evaluation behind Authentik forward-auth (`auth=required`) before deciding build-vs-adopt; covers collaborative trip planning + accommodation records + activities + per-person budget splitting on free OpenStreetMap (no paid maps key). SQLite + uploads on `proxmox-lvm-encrypted` (`trek-data-encrypted` 2Gi, `trek-uploads-encrypted` 5Gi). For the trial only: `ENCRYPTION_KEY` is TREK-auto-generated onto the data PVC and the bootstrap admin (`admin@trek.local`) is printed to pod logs — NO Vault/ESO wiring (graduation TODO: move key to `secret/trek` + ESO, add an app-level SQLite backup CronJob since host file-backup can't read the LUKS PVC, wire TREK↔Authentik OIDC). Pinned image, TF-managed (no CI/Keel). Availability-poll companion (Rallly) deferred. Teardown: `tg destroy` in `stacks/trek`. | trek | ## Cloudflare Domains diff --git a/.claude/skills/home-assistant/SKILL.md b/.claude/skills/home-assistant/SKILL.md index fe761f8c..ab07a27f 100644 --- a/.claude/skills/home-assistant/SKILL.md +++ b/.claude/skills/home-assistant/SKILL.md @@ -11,8 +11,8 @@ description: | There are TWO Home Assistant deployments: ha-london (default) and ha-sofia. Always use Home Assistant for smart home control. author: Claude Code -version: 2.0.0 -date: 2026-02-07 +version: 2.1.0 +date: 2026-06-24 --- # Home Assistant Control @@ -44,6 +44,12 @@ There are **two** Home Assistant instances: - Environment variables for each instance: - **ha-london**: `HOME_ASSISTANT_URL` and `HOME_ASSISTANT_TOKEN` - **ha-sofia**: `HOME_ASSISTANT_SOFIA_URL` and `HOME_ASSISTANT_SOFIA_TOKEN` + - If those env vars aren't set (e.g. you're not in the infra repo / Claude venv), don't hand-roll a `kubectl | base64 | jq` token pipeline — use the global **`homelab` CLI** instead (on `$PATH` in any directory): + +## homelab CLI (preferred — works from any directory) +- **Token**: `homelab ha token [--instance sofia|london]` resolves the long-lived API token live from the cluster. Use it directly in curl: `curl -H "Authorization: Bearer $(homelab ha token)" https://ha-sofia.viktorbarzin.me/api/states`. (The `home-assistant-sofia.py` script also auto-falls-back to this when its env var is unset.) +- **Host shell** (ha-sofia): `homelab ha ssh -- ` runs a command on the HA host with deterministic non-interactive ssh (no host-key prompt) — e.g. `homelab ha ssh -- "sudo docker ps"`, `homelab ha ssh -- "cat /config/configuration.yaml"`. Replaces bespoke `ssh -o StrictHostKeyChecking=no …` invocations. +- **Cluster metrics/logs** (not HA-specific): prefer `homelab metrics query ""` / `homelab logs query ""` over hand-rolled `curl …/api/v1/query`, and `homelab claim`/`release` over calling `scripts/presence` directly. ## API Control @@ -389,14 +395,27 @@ Advanced SSH, File Editor, Studio Code Server, InfluxDB, Mosquitto, Node-RED, Fr ## ha-london Knowledge Map ### Overview -- **HA Version**: 2025.9.1 (Docker container on Raspberry Pi) +- **HA Version**: 2026.5.2 on **Home Assistant OS** (HAOS — managed appliance, NOT a `docker run` container). Latest is 2026.6.4 (update available, deliberately not applied). - **Location**: London, UK -- **Platform**: Raspberry Pi 4, HA OS (not Docker standalone) -- **SSH**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) -- **Config path**: `/config/` (requires `sudo` for file access) +- **Platform**: Raspberry Pi 4, HA OS +- **Access from the Sofia devvm**: london is **remote** — `homelab ha ssh --instance london` generally WON'T connect (ADR-0012). Drive it via the API: `homelab ha token --instance london` + `https://ha-london.viktorbarzin.me/api/...`, and the WebSocket API `wss://ha-london.viktorbarzin.me/api/websocket` for dashboards / config-entries / HACS installs. +- **SSH (only from the London LAN)**: `ssh hassio@192.168.8.103` (requires `sudo` for file access) +- **Config path**: `/config/` - **3 tracked people**: Viktor Barzin, Anca Milea, Gheorghe Milea - **Zone**: London (home) +### Dashboards (redesigned 2026-06-24) +**Glossary** (HA terms — keep distinct): +- **Dashboard** = a sidebar entry (Overview, Air Quality, Map). Sidebar *order* is a per-USER frontend preference, not in any dashboard config. +- **View** = a tab inside a dashboard. View order is global (stored in the dashboard config). +- **Card** = a widget inside a view. + +- **Overview** (`lovelace`, the default): responsive **sections** views, styled with Mushroom + mini-graph-card. + - **Home** tab: *Who's home* · *Comfort & Air* (CO₂/temp/humidity/PM2.5/VOC chips + CO₂ and temp/humidity trend graphs + link to Air Quality) · *Cowboy* (battery/range/last-ride) · *Energy* (5 Kasa plugs + power trend) · *Quick actions* (Netflix/Stremio/Night). + - **More** tab: *Network* (GL-MT6000 router) · *System* (HA version/update, last backup, RPi power) · *Phones*. +- **Air Quality** (`air-quality`): deep-dive (views: Home, Detailed). (`detialed`→`detailed` path typo fixed 2026-06-24.) +- Built via the WS `lovelace/config/save` API (london is remote — no SSH path). + ### Key Systems #### 1. Smart Plugs (TP-Link Kasa) — Energy Monitoring @@ -418,10 +437,15 @@ Named plugs with power/energy tracking: - PM1.0/2.5/4.0/10 particulate sensors - VOC, NOx, ammonia, CO, ethanol, hydrogen, methane, NO2 gas sensors -#### 3. Cowboy E-Bike -- `sensor.bike_state_of_charge`: Battery % -- `sensor.bike_total_distance`: Total km -- `sensor.bike_total_co2_saved`: CO2 saved (grams) +#### 3. Cowboy E-Bike (`elsbrock/cowboy-ha`) +Bike named **"Classic Performance"** → entities are `sensor.classic_performance_*` (26 total). The old `sensor.bike_*` names are GONE (they were the dead `jdejaegh` integration). +- `sensor.classic_performance_remaining_battery`: Battery % (was `sensor.bike_state_of_charge`) +- `sensor.classic_performance_remaining_range`: Range km +- `sensor.classic_performance_mileage`: Total km (was `sensor.bike_total_distance`) +- `sensor.classic_performance_saved_co2`: Lifetime CO2 saved (was `sensor.bike_total_co2_saved`) +- Plus `_distance_today`, `_last_trip_*`, `_battery_health`, `device_tracker.classic_performance`, etc. +- **GOTCHA**: live battery/range/mileage read `unknown` while the bike is parked/asleep — Cowboy only reports live SoC when awake (ridden/charging); trip-history + `distance_today` stay live regardless. +- Auth: account **email+password** (no AWS Cognito — that was the dead `jdejaegh`/`cowboybike` lineage). Setup via UI config flow / REST `config_entries/flow`. Creds in Vaultwarden item **"cowboy bike"** (`homelab vault get "cowboy bike"`). #### 4. Uptime Monitoring (UptimeRobot) - `sensor.blog`: blog uptime @@ -440,12 +464,17 @@ Named plugs with power/energy tracking: - Scripts: `script.start_netflix`, `script.start_stremio` - Scene: `scene.night` (turns off Livia + Michelle plugs) -### Custom Components -- **cowboy**: Cowboy e-bike integration (HACS) -- **hildebrandglow_dcc**: UK smart meter DCC energy data (HACS) +### Custom Components (HACS integrations) +- **cowboy** (`elsbrock/cowboy-ha` v1.2.0): Cowboy e-bike — revived 2026-06-24. The old `jdejaegh/home-assistant-cowboy` repo is **dead (404)**; don't chase it. +- **hildebrandglow_dcc**: UK smart meter DCC energy — **DISABLED by user** (config entry `disabled_by: user`), not broken. + +### HACS frontend cards (plugins) +- **Mushroom** (`piitaya/lovelace-mushroom`), **mini-graph-card** (`kalkih/mini-graph-card`), **plotly-graph-card** (`dbuezas/lovelace-plotly-graph-card`) — used by the redesigned Overview. Install over WS `hacs/repository/download`; resources auto-register in storage mode. ### Integrations -ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BLE, Ookla Speedtest, HACS, OpenRouter (multiple free LLMs), Piper (local TTS), Whisper (local STT), Android TV/ADB +ESPHome, TP-Link Kasa, Tapo, UptimeRobot, **Cowboy** (elsbrock), Oral-B BLE, Ookla Speedtest (exposes only an `update` entity, no live speed sensors), HACS, OpenRouter (free LLMs), Piper (TTS), Whisper (STT), Android TV/ADB. +- **Disabled by user (NOT broken)**: `met` + `metoffice` (weather — so `weather.*` entities are ABSENT), `roomba` (Rumi vacuum), `hildebrandglow_dcc` (energy). +- **Failing**: `tplink` **Tapo P100** projector plug — `setup_retry`, 403 KLAP handshake from 192.168.8.108 (plug off / firmware). Left as-is. ### AI / Voice Assistants - 5 free LLM conversation agents: Google Gemma 3 27B, Meta Llama 3.2 3B, Mistral Devstral 2, OpenAI GPT-OSS-20B, Z.AI GLM 4.5 Air @@ -460,15 +489,8 @@ ESPHome, TP-Link Kasa, Tapo, UptimeRobot, Cowboy, Hildebrand Glow DCC, Oral-B BL - Anca arrival/departure notifications - Night scene: turns off Livia + Michelle -### Docker Setup -```bash -docker run -d --name homeassistant --privileged \ - -e TZ=Europe/London \ - -v /home/pi/docker/homeAssistant:/config \ - -v /run/dbus:/run/dbus:ro \ - --network=host --restart=unless-stopped \ - homeassistant/home-assistant:2025.9 -``` +### Platform (HAOS — ignore any legacy `docker run` snippet) +ha-london runs **Home Assistant OS** (managed appliance), NOT a hand-run Docker container. There is no `docker run homeassistant/home-assistant` to manage. Install HACS components over the WebSocket API (`hacs/repository/download` with the repo's HACS id), then restart via `POST /api/services/homeassistant/restart` — a HAOS restart drops automations for ~1–2 min and resets `sensor.uptime` (use that as the "back up" marker). ### SSH Access ```bash diff --git a/.claude/skills/upgrade-state/SKILL.md b/.claude/skills/upgrade-state/SKILL.md index a2027a50..34fe4731 100644 --- a/.claude/skills/upgrade-state/SKILL.md +++ b/.claude/skills/upgrade-state/SKILL.md @@ -51,7 +51,7 @@ Exit codes: `0` healthy, `1` attention warranted, `2` stalled / broken. |---|---|---|---| | **Apps** | Keel polls every watched Deployment's container registry; rolls on new digest | hourly | Prom (`pending_approvals`, `registries_scanned_total`), Keel pod logs | | **OS** | `unattended-upgrades` in-release patching; `kured` reboots when `/var/run/reboot-required` is set | daily 02:00-06:00 London | SSH fan-out to all 5 nodes | -| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | daily 12:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` | +| **K8s** | `k8s-version-check` CronJob detects new kubeadm patch/minor; spawns the Job-chain that drains+upgrades node-by-node | nightly 23:00 UTC | Pushgateway (`k8s_upgrade_*`), `kubectl get nodes` | The K8s pipeline pushes a small set of gauges to the Prometheus Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`): @@ -61,8 +61,11 @@ Pushgateway (`prometheus-prometheus-pushgateway.monitoring:9091`): - `k8s_upgrade_in_flight` — 0/1 - `k8s_upgrade_started_timestamp` — when the current chain started (0 when idle) -`K8sUpgradeStalled` alert fires when `in_flight=1` and the chain has -been running >90 minutes. The script raises `✗` in the same window. +`K8sUpgradeStalled` fires when `in_flight=1` and the chain has been running +>90 minutes. `K8sUpgradeChainJobFailed` fires when a phase Job terminally +failed — including a **preflight that aborted before `in_flight` was set** +(the gates exit pre-metric). The script raises `✗` for either, and reads the +Jobs directly, so it also catches a Failed preflight that left no metric. ## Status-icon legend @@ -72,7 +75,7 @@ been running >90 minutes. The script raises `✗` in the same window. | `→` | Update available, not yet applied (K8s patch/minor) | | `…` | In flight — chain currently running | | `⚠` | Attention: held-with-bumps, recent errors, pending approvals | -| `✗` | Broken: pod down, alert firing, chain stalled | +| `✗` | Broken: pod down, alert firing, chain stalled, or a chain Job failed | ## Drill-down — when a row trips, what to do @@ -177,6 +180,31 @@ kubectl -n monitoring exec deploy/prometheus-server -c prometheus-server -- sh - --header='Content-Type: text/plain'" ``` +### K8s `✗ chain failed` — a phase Job terminally failed + +`K8sUpgradeChainJobFailed` would fire. Most often a **preflight** that aborted +on a gate (a critical alert firing, a node not Ready, a kubeadm-plan mismatch) — +these exit before `in_flight` is set, so `K8sUpgradeStalled` never sees them, and +the deterministic name + 7d TTL blocked re-spawn (the 2026-06-12 5-day wedge). + +```bash +kubectl -n k8s-upgrade get jobs +kubectl -n k8s-upgrade describe job # check the Failed reason +# Preflight abort reasons post to Slack ONLY (not stdout), so Loki won't have +# them. Replay the gate instead — which critical alerts were firing at the +# failure time? (ALERTS{severity="critical"} in Prometheus, query at that ts.) +``` + +Recovery is now mostly automatic: the detection CronJob and `spawn_next` +re-spawn a terminally-Failed Job on the next cycle (retry-on-failure), so a +transient gate clears within ~24h. To expedite, delete the Failed Job and +trigger detection: + +```bash +kubectl -n k8s-upgrade delete job +kubectl -n k8s-upgrade create job --from=cronjob/k8s-version-check manual-detect-$(date +%s) +``` + ### K8s `✗ detection stale` — last detection >9 days ```bash diff --git a/.github/workflows/build-android-emulator.yml b/.github/workflows/build-android-emulator.yml new file mode 100644 index 00000000..3e9ffd5d --- /dev/null +++ b/.github/workflows/build-android-emulator.yml @@ -0,0 +1,36 @@ +name: Build android-emulator + +# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public). +# Large image (Android SDK + emulator); on-demand workload (scaled 0). Rebuilds +# rare → dispatch + path trigger. +on: + push: + branches: [master] + paths: + - 'stacks/android-emulator/docker/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/android-emulator/docker + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/android-emulator:latest + ghcr.io/viktorbarzin/android-emulator:${{ github.sha }} diff --git a/.github/workflows/build-chrome-service-browser.yml b/.github/workflows/build-chrome-service-browser.yml new file mode 100644 index 00000000..9d2129c8 --- /dev/null +++ b/.github/workflows/build-chrome-service-browser.yml @@ -0,0 +1,39 @@ +name: Build chrome-service-browser + +# ADR-0002: infra-owned image built off-infra on GHA → ghcr. Playwright base + +# real Google Chrome (proprietary H.264/AAC codecs) for the chrome-service +# browser container, so the noVNC view can play H.264 video (Reels). Rebuilds +# are rare → dispatch + path trigger. NOTE: after the first push, set the ghcr +# package `chrome-service-browser` to PUBLIC (same as chrome-service-novnc) so +# the pod pulls it without credentials. +on: + push: + branches: [master] + paths: + - 'stacks/chrome-service/files/chrome/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/chrome-service/files/chrome + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/chrome-service-browser:latest + ghcr.io/viktorbarzin/chrome-service-browser:${{ github.sha }} diff --git a/.github/workflows/build-chrome-service-novnc.yml b/.github/workflows/build-chrome-service-novnc.yml new file mode 100644 index 00000000..78daa6e5 --- /dev/null +++ b/.github/workflows/build-chrome-service-novnc.yml @@ -0,0 +1,36 @@ +name: Build chrome-service-novnc + +# ADR-0002: infra-owned image built off-infra on GHA → ghcr (public). +# Source Dockerfile identical on both git remotes, so the github checkout builds +# the current image. Rebuilds are rare (stable noVNC proxy) → dispatch + path. +on: + push: + branches: [master] + paths: + - 'stacks/chrome-service/files/novnc/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/chrome-service/files/novnc + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/chrome-service-novnc:latest + ghcr.io/viktorbarzin/chrome-service-novnc:${{ github.sha }} diff --git a/.github/workflows/build-cli.yml b/.github/workflows/build-cli.yml new file mode 100644 index 00000000..f27856dc --- /dev/null +++ b/.github/workflows/build-cli.yml @@ -0,0 +1,41 @@ +name: Build infra CLI + +# ADR-0002: infra CLI built off-infra on GHA. Replaces the Woodpecker +# build-cli.yml. Pushes to DockerHub (public distribution, kept) + ghcr. +# Not a cluster workload — a distributed tool image. +on: + push: + branches: [master] + paths: + - 'cli/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/login-action@v3 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: cli + platforms: linux/amd64 + provenance: false + push: true + tags: | + viktorbarzin/infra:latest + ghcr.io/viktorbarzin/infra-cli:latest + ghcr.io/viktorbarzin/infra-cli:${{ github.sha }} diff --git a/.github/workflows/build-infra-ci.yml b/.github/workflows/build-infra-ci.yml new file mode 100644 index 00000000..f3a4614f --- /dev/null +++ b/.github/workflows/build-infra-ci.yml @@ -0,0 +1,37 @@ +name: Build infra-ci + +# ADR-0002: the infra CI toolbox image (terraform/terragrunt/sops/kubectl/vault) +# built off-infra on GHA → ghcr (public). BOOTSTRAP-CRITICAL: .woodpecker/default.yml's +# apply step runs in this image. The Woodpecker build-ci-image.yml is kept until a +# ghcr-based apply is proven, then removed. +on: + push: + branches: [master] + paths: + - 'ci/Dockerfile' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: ci + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/infra-ci:latest + ghcr.io/viktorbarzin/infra-ci:${{ github.sha }} diff --git a/.github/workflows/build-k8s-portal.yml b/.github/workflows/build-k8s-portal.yml new file mode 100644 index 00000000..f81e13af --- /dev/null +++ b/.github/workflows/build-k8s-portal.yml @@ -0,0 +1,36 @@ +name: Build k8s-portal + +# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra +# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces +# the in-cluster .woodpecker/k8s-portal.yml build. +on: + push: + branches: [master] + paths: + - 'stacks/k8s-portal/modules/k8s-portal/files/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/k8s-portal/modules/k8s-portal/files + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/k8s-portal:latest + ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }} diff --git a/.gitignore b/.gitignore index 3475f32a..b288aed5 100755 --- a/.gitignore +++ b/.gitignore @@ -103,3 +103,16 @@ stacks/terminal/clipboard-upload/clipboard-upload # Plaintext terraform state — NEVER commit (use SOPS-encrypted .tfstate.enc only) terraform.tfstate terraform.tfstate.backup + +# Per-feature git worktrees (worktree-first workflow — execution.md) +.worktrees/ + +# Timestamped terraform state backups (terraform.tfstate..backup) — plaintext Tier-0 +# secrets; created by terraform state ops. The patterns above miss the timestamped form. +terraform.tfstate.*.backup + +# Python test artifacts (pytest bytecode cache) — e.g. from +# stacks/k8s-version-upgrade/scripts/test_compat_gate.py +__pycache__/ +*.pyc +.pytest_cache/ diff --git a/.mcp.json b/.mcp.json index 9f39ff76..18bb4d81 100644 --- a/.mcp.json +++ b/.mcp.json @@ -3,10 +3,6 @@ "ha": { "type": "http", "url": "${HA_MCP_URL}" - }, - "paperless": { - "type": "http", - "url": "http://paperless-mcp.paperless-mcp.svc.cluster.local/mcp" } } } diff --git a/.woodpecker/breakglass-infra-ci.yml b/.woodpecker/breakglass-infra-ci.yml new file mode 100644 index 00000000..bbc43d7d --- /dev/null +++ b/.woodpecker/breakglass-infra-ci.yml @@ -0,0 +1,31 @@ +# Break-glass: save the ghcr infra-ci image to a tarball on the registry VM +# (10.0.20.10) so it can be `docker load`-ed onto a node if ghcr is ever +# unreachable during a recovery. infra-ci now builds on GHA → ghcr (ADR-0002), +# which is external + node-cached, so this is a belt-and-braces DR artifact — +# run MANUALLY after an infra-ci rebuild (or periodically). Pulls from ghcr +# (public, no login). Recovery: docs/runbooks/forgejo-registry-breakglass.md. +when: + - event: manual + +steps: + - name: breakglass-tarball + image: alpine:3.20 + failure: ignore + environment: + REGISTRY_SSH_KEY: + from_secret: registry_ssh_key + commands: + - apk add --no-cache openssh-client + - mkdir -p ~/.ssh && chmod 700 ~/.ssh + - printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519 + - chmod 600 ~/.ssh/id_ed25519 + - ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null + - | + ssh -n -o BatchMode=yes root@10.0.20.10 " + set -e + mkdir -p /opt/registry/data/private/_breakglass + IMAGE=ghcr.io/viktorbarzin/infra-ci:latest + docker pull \$IMAGE + docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz + ls -lh /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz + " diff --git a/.woodpecker/build-ci-image.yml b/.woodpecker/build-ci-image.yml deleted file mode 100644 index 796426ac..00000000 --- a/.woodpecker/build-ci-image.yml +++ /dev/null @@ -1,88 +0,0 @@ -# Build the CI tools Docker image used by all infra pipelines. -# Triggers on push that touches ci/Dockerfile, or manual (API/UI) so -# rebuilds after a registry incident don't need a cosmetic Dockerfile edit. - -when: - - event: push - branch: master - path: - include: - - 'ci/Dockerfile' - - event: manual - -steps: - - name: build-and-push - image: woodpeckerci/plugin-docker-buildx - settings: - # Phase 4 of forgejo-registry-consolidation 2026-05-07 — - # registry.viktorbarzin.me dropped, Forgejo is the only target. - repo: - - forgejo.viktorbarzin.me/viktor/infra-ci - dockerfile: ci/Dockerfile - context: ci/ - tags: - - latest - - "${CI_COMMIT_SHA:0:8}" - platforms: linux/amd64 - logins: - - registry: forgejo.viktorbarzin.me - username: - from_secret: forgejo_user - password: - from_secret: forgejo_push_token - - # Post-push integrity check is now redundant with the every-15min - # forgejo-integrity-probe in stacks/monitoring/, which walks - # /v2/_catalog + HEADs every blob across the entire Forgejo registry. - # If a corruption pattern emerges that the periodic probe misses, - # restore a verify step similar to the pre-Phase-4 version (see - # commit 49f4956f) but pointed at forgejo.viktorbarzin.me. - - # Break-glass tarball: save the just-pushed infra-ci image to disk on the - # registry VM (10.0.20.10) so we can `docker load` it back into a node - # when Forgejo is unreachable. Pulls from Forgejo (the only registry now). - # Best-effort — failure here doesn't fail the pipeline. - # Recovery procedure: docs/runbooks/forgejo-registry-breakglass.md. - - name: breakglass-tarball - image: alpine:3.20 - failure: ignore - environment: - REGISTRY_SSH_KEY: - from_secret: registry_ssh_key - FORGEJO_USER: - from_secret: forgejo_user - FORGEJO_PASS: - from_secret: forgejo_push_token - commands: - - apk add --no-cache openssh-client - - mkdir -p ~/.ssh && chmod 700 ~/.ssh - - printf '%s\n' "$REGISTRY_SSH_KEY" > ~/.ssh/id_ed25519 - - chmod 600 ~/.ssh/id_ed25519 - - ssh-keyscan -t ed25519 10.0.20.10 >> ~/.ssh/known_hosts 2>/dev/null - - SHA=${CI_COMMIT_SHA:0:8} - - | - ssh -n -o BatchMode=yes root@10.0.20.10 " - set -e - mkdir -p /opt/registry/data/private/_breakglass - IMAGE=forgejo.viktorbarzin.me/viktor/infra-ci:$SHA - echo \$FORGEJO_PASS | docker login forgejo.viktorbarzin.me -u \$FORGEJO_USER --password-stdin - docker pull \$IMAGE - docker save \$IMAGE | gzip > /opt/registry/data/private/_breakglass/infra-ci-$SHA.tar.gz - ln -sfn infra-ci-$SHA.tar.gz /opt/registry/data/private/_breakglass/infra-ci-latest.tar.gz - ls -t /opt/registry/data/private/_breakglass/infra-ci-*.tar.gz \ - | grep -v 'latest' | tail -n +6 | xargs -r rm -v - ls -lh /opt/registry/data/private/_breakglass/ - " - - - name: slack - image: curlimages/curl - commands: - - | - curl -s -X POST -H 'Content-type: application/json' \ - --data "{\"text\":\"CI image built: forgejo.viktorbarzin.me/viktor/infra-ci:${CI_COMMIT_SHA:0:8} (and registry-private mirror)\"}" \ - "$SLACK_WEBHOOK" || true - environment: - SLACK_WEBHOOK: - from_secret: slack_webhook - when: - status: [success] diff --git a/.woodpecker/build-cli.yml b/.woodpecker/build-cli.yml deleted file mode 100644 index cf95da7e..00000000 --- a/.woodpecker/build-cli.yml +++ /dev/null @@ -1,42 +0,0 @@ -when: - event: push - -clone: - git: - image: woodpeckerci/plugin-git - settings: - attempts: 5 - backoff: 10s - -steps: - - name: build-image - image: woodpeckerci/plugin-docker-buildx - settings: - username: "viktorbarzin" - password: - from_secret: dockerhub-pat - # Phase 4 of forgejo-registry-consolidation 2026-05-07 — - # registry.viktorbarzin.me:5050 decommissioned. Push to DockerHub - # (the public-facing infra image) AND Forgejo (the cluster pull - # source). Same image, two locations. - repo: - - viktorbarzin/infra - - forgejo.viktorbarzin.me/viktor/infra - logins: - - registry: https://index.docker.io/v1/ - username: viktorbarzin - password: - from_secret: dockerhub-pat - - registry: forgejo.viktorbarzin.me - username: - from_secret: forgejo_user - password: - from_secret: forgejo_push_token - dockerfile: cli/Dockerfile - context: cli - auto_tag: true - # cache_from/cache_to removed: registry cache corruption causes - # "short read: expected 32 bytes" BuildKit errors. Inline cache - # will be re-populated once a clean image is pushed. - # cache_from: "registry.viktorbarzin.me:5050/infra:latest" - # cache_to: "type=inline" diff --git a/.woodpecker/default.yml b/.woodpecker/default.yml index 5661bccd..ef94ccee 100644 --- a/.woodpecker/default.yml +++ b/.woodpecker/default.yml @@ -19,13 +19,34 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 2 attempts: 5 backoff: 10s steps: + # Audit feed for the allow-then-audit contribution model: any master push by + # a NON-admin author is surfaced in Slack (Viktor's own pushes are not). + # Runs before apply and never blocks it. Note: [ci skip] commits never reach + # this step (Woodpecker skips the whole pipeline) — hence the rule that + # non-admins must not use [ci skip]. + - name: notify-nonadmin-push + image: curlimages/curl + environment: + SLACK_WEBHOOK: + from_secret: slack_webhook + commands: + - | + case "$CI_COMMIT_AUTHOR" in + viktor|ViktorBarzin|wizard) echo "admin push — no notify"; exit 0 ;; + esac + SUBJECT=$(echo "$CI_COMMIT_MESSAGE" | head -1 | tr -d '"\\') + curl -s -X POST -H 'Content-type: application/json' \ + --data "{\"text\":\"📝 infra master push by *$CI_COMMIT_AUTHOR*: $SUBJECT\n$CI_REPO_URL/commit/$CI_COMMIT_SHA\"}" \ + "$SLACK_WEBHOOK" || true + - name: apply - image: forgejo.viktorbarzin.me/viktor/infra-ci:latest + image: ghcr.io/viktorbarzin/infra-ci:latest pull: true backend_options: kubernetes: @@ -115,6 +136,25 @@ steps: git fetch --deepen=1 origin master 2>/dev/null || true fi + # Diff base: prefer the push's true before-state (CI_PREV_COMMIT_SHA). + # HEAD~1 is WRONG for merge commits — it is the first parent (the + # feature-branch side), so the diff shows the OTHER lineage's files + # and silently skips the stacks this push actually changed + # (bit ci-pipeline-health on 2026-06-12, pipeline 128). + DIFF_BASE="HEAD~1" + if [ -n "${CI_PREV_COMMIT_SHA:-}" ] && [ "$CI_PREV_COMMIT_SHA" != "$CI_COMMIT_SHA" ]; then + git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null || git fetch --depth=50 origin master 2>/dev/null || true + # Restarted pipelines after master moved produce REVERSE diffs + # (CI_PREV ahead of the checked-out HEAD re-applied stale trees and + # reverted a sibling apply on 2026-06-12, pipeline 148). Only use + # CI_PREV when it is an ancestor of HEAD. + if git cat-file -e "$CI_PREV_COMMIT_SHA^{commit}" 2>/dev/null \ + && git merge-base --is-ancestor "$CI_PREV_COMMIT_SHA" HEAD 2>/dev/null; then + DIFF_BASE="$CI_PREV_COMMIT_SHA" + fi + fi + echo "Diff base: $DIFF_BASE" + # If still no parent, apply all platform stacks as a safe fallback if ! git rev-parse HEAD~1 >/dev/null 2>&1; then echo "Cannot determine changed files — applying ALL platform stacks" @@ -122,14 +162,14 @@ steps: > .app_apply else # Check if global files changed (triggers full platform apply) - GLOBAL_CHANGED=$(git diff --name-only HEAD~1 HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true) + GLOBAL_CHANGED=$(git diff --name-only "$DIFF_BASE" HEAD | grep -E '^(modules/|config\.tfvars|terragrunt\.hcl)' || true) if [ -n "$GLOBAL_CHANGED" ]; then echo "Global files changed — applying ALL platform stacks" echo "$PLATFORM_STACKS" | tr ' ' '\n' > .platform_apply else # Detect platform stacks that changed - git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed + git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u > .all_changed > .platform_apply while read -r stack; do if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then @@ -140,7 +180,7 @@ steps: # Detect app stacks that changed > .app_apply - git diff --name-only HEAD~1 HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do + git diff --name-only "$DIFF_BASE" HEAD | grep '^stacks/' | cut -d/ -f2 | sort -u | while read -r stack; do if echo "$PLATFORM_STACKS" | grep -qw "$stack"; then continue # Skip platform stacks fi diff --git a/.woodpecker/drift-detection.yml b/.woodpecker/drift-detection.yml index 38cc60b9..b2e303ff 100644 --- a/.woodpecker/drift-detection.yml +++ b/.woodpecker/drift-detection.yml @@ -9,12 +9,13 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 1 attempts: 3 steps: - name: detect-drift - image: forgejo.viktorbarzin.me/viktor/infra-ci:latest + image: ghcr.io/viktorbarzin/infra-ci:latest pull: true backend_options: kubernetes: diff --git a/.woodpecker/issue-automation.yml b/.woodpecker/issue-automation.yml index ece97dab..2bb46661 100644 --- a/.woodpecker/issue-automation.yml +++ b/.woodpecker/issue-automation.yml @@ -5,6 +5,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 2 steps: diff --git a/.woodpecker/k8s-portal.yml b/.woodpecker/k8s-portal.yml deleted file mode 100644 index 39c9ff17..00000000 --- a/.woodpecker/k8s-portal.yml +++ /dev/null @@ -1,49 +0,0 @@ -when: - event: push - branch: master - path: - include: - - "stacks/platform/modules/k8s-portal/files/**" - -clone: - git: - image: woodpeckerci/plugin-git - settings: - attempts: 5 - backoff: 10s - -steps: - - name: build-and-push - image: woodpeckerci/plugin-docker-buildx - settings: - username: "viktorbarzin" - password: - from_secret: dockerhub-pat - repo: viktorbarzin/k8s-portal - dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile - context: stacks/platform/modules/k8s-portal/files - platforms: - - linux/amd64 - tag: ["${CI_PIPELINE_NUMBER}", "latest"] - cache_from: "viktorbarzin/k8s-portal:latest" - cache_to: "type=inline" - - - name: deploy - image: bitnami/kubectl:latest - commands: - - "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal" - - "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s" - - "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'" - - - name: slack - image: curlimages/curl - commands: - - | - curl -s -X POST -H 'Content-type: application/json' \ - --data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \ - "$SLACK_WEBHOOK" || true - environment: - SLACK_WEBHOOK: - from_secret: slack_webhook - when: - status: [success, failure] diff --git a/.woodpecker/postmortem-todos.yml b/.woodpecker/postmortem-todos.yml index 729e9a85..68330272 100644 --- a/.woodpecker/postmortem-todos.yml +++ b/.woodpecker/postmortem-todos.yml @@ -11,6 +11,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 5 steps: diff --git a/.woodpecker/provision-user.yml b/.woodpecker/provision-user.yml index 0f6d5dab..3ba7af7f 100644 --- a/.woodpecker/provision-user.yml +++ b/.woodpecker/provision-user.yml @@ -5,6 +5,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false attempts: 5 backoff: 10s diff --git a/.woodpecker/pve-nfs-exports-sync.yml b/.woodpecker/pve-nfs-exports-sync.yml index 2c26df45..54aea68a 100644 --- a/.woodpecker/pve-nfs-exports-sync.yml +++ b/.woodpecker/pve-nfs-exports-sync.yml @@ -23,6 +23,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 1 attempts: 3 diff --git a/.woodpecker/registry-config-sync.yml b/.woodpecker/registry-config-sync.yml index a4f03185..aad59fbe 100644 --- a/.woodpecker/registry-config-sync.yml +++ b/.woodpecker/registry-config-sync.yml @@ -38,6 +38,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false depth: 1 attempts: 3 diff --git a/.woodpecker/renew-tls.yml b/.woodpecker/renew-tls.yml index d2d8bf89..cd93fe7c 100644 --- a/.woodpecker/renew-tls.yml +++ b/.woodpecker/renew-tls.yml @@ -6,6 +6,7 @@ clone: git: image: woodpeckerci/plugin-git settings: + partial: false attempts: 5 backoff: 10s diff --git a/AGENTS.md b/AGENTS.md index 009c5c99..7fbc838d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -9,7 +9,7 @@ - **Ask before `git push`** — always confirm with the user first ## Execution -- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets) +- **Apply a service**: `scripts/tg apply --non-interactive` (auto-decrypts SOPS secrets; passes `-lock-timeout`, default `5m` / `TG_LOCK_TIMEOUT`, so a contended state lock waits instead of failing with `Error acquiring the state lock`) - **Legacy apply**: `cd stacks/ && terragrunt apply --non-interactive` (uses terraform.tfvars) - **kubectl**: `kubectl --kubeconfig $(pwd)/config` - **Health check**: `bash scripts/cluster_healthcheck.sh --quiet` @@ -90,6 +90,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro - **Public domain**: `viktorbarzin.me` (Cloudflare) | **Internal**: `viktorbarzin.lan` (Technitium DNS) - **Onboarding portal**: `https://k8s-portal.viktorbarzin.me` — self-service kubectl setup + docs - **CI/CD**: Woodpecker CI — PRs run plan, merges to master auto-apply all stacks +- **CI compute is external (ADR-0002, 2026-06-12)**: builds, tests, lint, and release jobs run on GitHub Actions hosted runners via each repo's GitHub mirror — never on cluster nodes. In-cluster pipelines exist only for steps that need cluster access (Woodpecker `kubectl set image` deploys, terragrunt applies, certbot). Never add an in-cluster build or test pipeline to any repo; the fallback-build pattern was deliberately removed. After pushing anything that fires a build chain, watch it end-to-end (GHA run → Woodpecker deploy → rollout) before calling the change done — verify live state, not the checkmark. ## Key Paths - `stacks//main.tf` — service definition @@ -109,7 +110,8 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro - **SQLite on NFS is unreliable** (fsync issues) — always use proxmox-lvm or local disk for databases. - **NFS mount options**: Always `soft,timeo=30,retrans=3` to prevent uninterruptible sleep (D state). - **NFS export directory must exist** on the Proxmox host before Terraform can create the PV. -- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking). +- **Backup (3-2-1)**: Copy 1 = live PVCs on sdc. Copy 2 = sda `/mnt/backup` (PVC file backups, auto SQLite backups, pfSense, PVE config, **VM images via `vzdump-vms`**). Copy 3 = Synology offsite (two-tier: sda→`pve-backup/`, NFS→`nfs/`+`nfs-ssd/` via inotify change tracking). +- **vzdump-vms** (Daily 01:00): live `vzdump --mode snapshot` of hand-managed VMs (NOT in TF) → `/mnt/backup/vzdump/`, keep 3/VMID. `VZDUMP_VMIDS` default `102` (devvm) — the only VM imaged today; before this (2026-06-09) no VM was ever imaged. NOT in the incremental offsite manifest; monthly full pass mirrors it. See `docs/architecture/backup-dr.md`. - **daily-backup** (Daily 05:00): Auto-discovered BACKUP_DIRS (glob), auto SQLite backup (magic number + `?mode=ro`), pfSense, PVE config. No NFS mirror step (NFS syncs directly to Synology via inotify). - **offsite-sync-backup** (Daily 06:00): Step 1: sda→Synology `pve-backup/`. Step 2: NFS→Synology `nfs/`+`nfs-ssd/` via `rsync --files-from` (inotify change log). Monthly full `--delete`. - **nfs-change-tracker.service**: inotifywait on `/srv/nfs` + `/srv/nfs-ssd`, logs to `/mnt/backup/.nfs-changes.log`. Incremental syncs complete in seconds. @@ -225,7 +227,69 @@ Per-workload opt-out: add the label `keel.sh/policy: never` on the Deployment me 4. Viktor reviews → CI applies → Slack notification 5. Portal: `https://k8s-portal.viktorbarzin.me/onboarding` for full guide +### Non-admin workstation users — the AGENT does the git work + +Non-admin devvm users (power-user / namespace-owner tiers) may not know git at +all. Their agent handles every version-control step silently — never ask them +to commit, push, pull, or open a PR, and never surface git jargon at them. +Their infra clone arrives preconfigured: git identity, a `forgejo` remote +authenticated via `~/.git-credentials`, and `master` tracking `forgejo/master` +(auto-freshened hourly and at session launch, fast-forward only). + +Two per-user layouts exist (`code_layout` in +`scripts/workstation/roster.yaml`): `single` (the default) — `~/code` IS the +locked infra clone — and `workspace` — `~/code` is a plain directory of +per-project clones: the infra clone at `~/code/infra`, plus each roster +`repos` entry (e.g. `~/code/tripit`) cloned from Forgejo `viktor/` with +the user's own PAT. The reconcile auto-migrates a single-layout `~/code` when +a user is flipped to `workspace`, and keeps every clone fresh either way. + +The model is **allow-then-audit** (Viktor, 2026-06-10): whitelisted users (emo) +push straight to `master` — no PR gate — and the record of *what changed and +why* is what matters. Force-push is disabled for everyone, so master history +is append-only. + +**Feature-sized work is worktree-first** (org rule, 2026-06-10): develop in an +isolated worktree (`.worktrees/`, branch `/` off +`forgejo/master`) so concurrent agent sessions never collide in the clone, then +land by merging latest master into the branch and pushing it +(`git push forgejo HEAD:master`, or the PR fallback below if not whitelisted) — +the audit-trail rules below apply to the branch's commit messages all the same. +Locked (git-crypt) clones can use plain `git worktree add`. Trivial +single-commit fixes may be committed directly on a clean `master`. Full +lifecycle: `~/.claude/rules/execution.md` §3. + +To land a finished change from such a clone: + +1. Commit on `master`. **The commit message is the audit trail** — this matters + more than the change itself: + - subject: what changed, specific ("ha-sofia: lower fan curve bias to -5") + - body: WHY, in plain words — paraphrase the user's actual request and any + reasoning ("Emil asked for quieter fans in the evening; curve was + overshooting after the 2026-06-08 redesign") +2. `git push forgejo master`. If rejected non-fast-forward: `git pull --rebase + forgejo master` and push again. +3. **Never use `[ci skip]`** as a non-admin — it hides the change from the + Slack audit feed; a no-op CI apply on a docs-only commit is harmless. +4. Leave the clone on clean `master` so auto-refresh keeps working. +5. Tell the user in plain language what happened. Stack changes are + auto-applied by CI — verify the live result with the user's read-only + kubectl before saying "it's live". + +If a push to `master` is rejected by branch protection (user not on the +whitelist — e.g. new users before Viktor grants it), fall back to a +`/` branch + PR with the user's own PAT +(`write:repository` suffices — verified 2026-06-10): + +```bash +TOK=$(sed -E 's#https://[^:]+:([^@]+)@.*#\1#' ~/.git-credentials) +curl -X POST -H "Authorization: token $TOK" -H 'Content-Type: application/json' \ + https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/pulls \ + -d '{"title":"","head":"<os-user>/<short-topic>","base":"master","body":"<what + why>"}' +``` + ## Common Operations +- **`homelab` CLI** (`/usr/local/bin/homelab`, source `cli/`): unified infra-ops verbs — run `homelab manifest` to discover the surface (each verb tagged read/write). Infra loop: `homelab tf plan|fmt|apply <stack>` (wraps `scripts/tg`; `apply` auto-claims presence + releases on exit, warns out-of-band), `homelab claim|release <kind>:<name>`, `homelab work start|land|clean <topic>` (worktree lifecycle; `land` gates on verification, `--verify-cmd`/`--no-verify`). Kubernetes (v0.2): `homelab k8s status|get|logs|describe|debug|pf|rollout-status <app>` (read; `<app>` defaults to the namespace, target to `deploy/<app>`), `homelab k8s db <app> [--mysql] -- "<SQL>"`, `k8s exec`, `k8s restart`, `k8s rm-pod` (pods/jobs only) — config-mutation kubectl verbs are intentionally absent (Terraform-only). Memory (v0.3): `homelab memory recall "<context>"` (semantic search), `memory list|categories|tags|stats|secret`, `memory store|update|delete` — a direct HTTP client to claude-memory that works even when the memory MCP is down. CI/deploy (v0.4): `homelab ci status|watch [commit]` (Woodpecker, repo resolved from cwd), `homelab deploy wait <ns>/<deploy> [--sha]` (image-sha + rollout) — `work land` now auto-watches CI to green. Net/obs (v0.5): `homelab net check <host> [path]` (external-CF vs internal-LB reachability), `dns lookup <name>` (Technitium vs public diff), `metrics query "<promql>"` / `metrics alerts` (Prometheus via LB), `logs query "<logql>" [--since]` (Loki via LB) — endpoint resolution baked in, no port-forward. Usage telemetry (v0.6): every dispatched verb fire-and-forgets a Loki line (`{user,verb}` + exit only, NO args/secrets; opt-out `HOMELAB_TELEMETRY=0`); `homelab usage top [--since][--user]` ranks verb usage across all users — evidence for what to build next, queryable without reading anyone's home. Home Assistant (v0.7): `homelab ha token [--instance sofia|london]` (prints the long-lived API token, resolved live from k8s Secret `openclaw/openclaw-secrets` — use as `curl -H "Authorization: Bearer $(homelab ha token)"`), `homelab ha ssh [--instance sofia|london] -- <cmd>` (run a command on the HA host; deterministic non-interactive ssh, the invoking user's `~/.ssh/id_ed25519`, sofia=`vbarzin@192.168.1.8` default) — entity state/control stays with the `ha` MCP, these cover only what an API-only MCP can't (token + host shell). Full docs: `cli/README.md`. - **Deploy new service**: Use `stacks/<existing-service>/` as template. Create stack, add DNS in tfvars, apply platform then service. - **Fix crashed pods**: Run healthcheck first. Safe to delete evicted/failed pods and CrashLoopBackOff pods with >10 restarts. - **OOMKilled**: Check `kubectl describe limitrange tier-defaults -n <ns>`. Increase `resources.limits.memory` in the stack's main.tf. diff --git a/CONTEXT.md b/CONTEXT.md index c9a9d033..2b9bb8b3 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -117,9 +117,17 @@ The bare-metal load-balancer that assigns external IPs to `type=LoadBalancer` Se _Avoid_: calling `.200` "the cluster IP" or assuming all ingress shares one LB IP. **Calico**: -The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred). +The cluster CNI and **NetworkPolicy** engine (also GlobalNetworkPolicy + flow logs; live flow observability via **Goldmane / Whisker**). Egress lockdown follows an **observe-then-enforce** rollout — flow logs build an empirical allowlist, then default-deny egress is enforced per-namespace, tier by tier (wave 1 began at `recruiter-responder`; Tier 0/1/2 deferred). _Avoid_: "firewall" (it's pod-level policy, not a perimeter); conflating a Calico **NetworkPolicy** (enforced in the data path) with a **Kyverno policy** (enforced at admission) — different layers. +**Service identity**: +How a **Service** is named in flow/audit data — its **namespace** is the primary identity (Goldmane stamps it natively, and "one Service ≈ one namespace" holds for ~87 namespaces), refined by an explicit identity label (e.g. `service-identity`) only in the handful of genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). Deliberately NOT a per-Service **ServiceAccount** (deferred — 56% of pods share `default`; revisit only if principal-based enforcement or mTLS is adopted) and NOT a SPIFFE/mesh identity (rejected — attribution-grade audit on a trusted single-tenant cluster doesn't justify a mesh). +_Avoid_: equating "service identity" with a workload's **ServiceAccount** (that's the deferred enforcement principal, not the attribution key) or with cryptographic/SPIFFE identity; "Service" here is the domain **Service**, not the K8s `Service` object. + +**Goldmane / Whisker**: +Calico 3.30's OSS flow-observability pair — **Goldmane** aggregates identity-stamped flows (namespace/pod/workload/labels + allow-deny + policy trace) streamed from Felix over gRPC into an in-memory ~60-min ring buffer (no etcd/API writes); **Whisker** is its live web UI. The east-west "who-talks-to-whom" data plane, succeeding raw iptables-`LOG`→journald lines (which carry no identity). Durable history requires emitting Goldmane flows to **Loki**; the in-memory buffer alone is not an audit trail. +_Avoid_: assuming Goldmane persists (it's a ring buffer — lost on restart); expecting a ServiceAccount field in its schema (it carries labels, not SA); confusing it with Cilium **Hubble** (needs the Cilium datapath, unusable on Calico) or **Kiali** (needs an Istio mesh). + ### Storage **proxmox-lvm-encrypted**: @@ -149,7 +157,7 @@ _Avoid_: bare "backup" without saying which copy you mean (a service is "backed **CNPG** / **pg-cluster**: **CNPG** is the CloudNativePG operator; **`pg-cluster`** is the Postgres cluster it manages — the shared Postgres substrate. Backs Tier-1 Terraform state (`pg-cluster-rw.dbaas.svc.cluster.local:5432/terraform_state`) and ~12 application databases, reached through **PgBouncer** (a **critical-path Service**) for connection pooling; app credentials rotate via the `vault-database` ClusterSecretStore. -_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service (no endpoints — dead); conflating the CNPG operator with the `pg-cluster` it manages. +_Avoid_: "the database" (many DBs share one cluster); the legacy `postgresql.dbaas` Service for NEW work (it is a live compatibility alias selecting the CNPG primary — authentik's PgBouncer still uses it — but `pg-cluster-rw` is the canonical name); conflating the CNPG operator with the `pg-cluster` it manages. ### Secrets @@ -169,8 +177,24 @@ A user-managed secret committed to a Stack directory as `sealed-*.yaml`. Distinc ### CI/CD **GHA build + Woodpecker deploy**: -The split where Docker images are built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline. Repos that can't fit GHA limits stay on Woodpecker for build too. -_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy". +The split where every owned image is built+pushed by GitHub Actions and Woodpecker only runs `kubectl set image` on a deploy-only pipeline (ADR-0002). Woodpecker never builds images. +_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002). + +**Canonical repo**: +The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target. +_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003). + +**GitHub mirror**: +The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost. +_Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror. + +**GitHub-first repo**: +The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed. +_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**. + +**Forgejo registry**: +Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io. +_Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it. **Keel**: The **poll-driven** rollout orchestrator — watches registries for new image tags and rolls the matching Deployments automatically. The actor behind "auto-upgrade" for upstream images, and a redundant net for owned apps (already rolled on push by **Woodpecker deploy**). @@ -192,6 +216,7 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content - A **proxmox-lvm-encrypted** PVC binds to one Node at a time (RWO) and requires a Service-level backup CronJob; an **NFS volume** is RWX and is backed up at the host level via rsync. - **State tier** and **Namespace tier** are orthogonal — a Tier 0 Stack can deploy a Service into any Namespace tier and vice versa. - A **Service**'s image reaches the cluster via **Woodpecker deploy** (push-driven, on commit) or **Keel** (poll-driven, on a new registry tag); **Diun** only notifies. Operator-managed StatefulSets are rolled by neither. +- An owned **Service**'s image is built by GitHub Actions from the **Canonical repo**'s **GitHub mirror** and hosted on ghcr.io (ADR-0002); the **Forgejo registry** keeps only a frozen last-known-good tag per **Service**. - Tier-1 **State tier** state and ~12 app databases share one **CNPG** `pg-cluster`, reached through **PgBouncer**; their credentials rotate via the `vault-database` store. ## Example dialogue @@ -211,3 +236,4 @@ A PoW reverse-proxy issuing a 30-day JWT cookie, used in front of public content - **"secret"** spans Vault entries, K8s Secret objects, **ExternalSecrets**, and **Sealed Secrets**. Always specify which. - **"proxied"** / **"non-proxied"** refer to Cloudflare's CDN posture for a DNS record, _not_ Anubis or forward-auth layering. - **"policy"** spans **Kyverno policy** (admission-time mutate/generate/validate), **Calico NetworkPolicy** (data-path ingress/egress), Vault policy (KV access), and K8s RBAC. Always qualify which engine. +- **"registry"** spans three things: ghcr.io (where owned images live, ADR-0002), the **Forgejo registry** (frozen last-known-good archive), and the registry VM's pull-through caches (read-only proxies of upstream registries). Name which one. diff --git a/cli/README.md b/cli/README.md index 48b83c93..186c1ee5 100644 --- a/cli/README.md +++ b/cli/README.md @@ -1,2 +1,224 @@ -# What is this? -This is a CLI to manipulate files in the terraform repo and commit and push them +# homelab + +`homelab` is the unified, agent-facing CLI for operating this homelab — one +composable, JSON-capable surface for the operations agents run over and over, +discovered progressively at runtime. It is grown **in place** from this +directory (the former `infra-cli`), and the legacy webhook use-cases still work +(see below). + +It encodes *actions*, never *judgment*: methodology (debugging, TDD, review) and +third-party/owned MCP servers (e.g. phpIPAM) are deliberately out of scope. + +## Usage + +``` +homelab <command> [args] +homelab manifest [--json] # list every verb + its read/write tier (discovery entrypoint) +homelab version +``` + +### v0.1 verbs — the infra inner-loop + +| Command | Tier | What it does | +|---|---|---| +| `claim <kind>:<name> --purpose "…"` | write | claim a shared resource on the presence board (wraps `scripts/presence`) | +| `release <kind>:<name>` | write | release a presence claim | +| `tf plan <stack>` | read | `scripts/tg plan` for a stack (resolved from cwd) | +| `tf validate <stack>` | read | `scripts/tg validate` | +| `tf fmt <stack>` | read | `terraform fmt -recursive` on the stack | +| `tf force-unlock <stack> <lock-id>` | write | release a stuck state lock | +| `tf apply <stack>` | write | `scripts/tg apply` — auto-claims `stack:<name>`, always releases, warns it's out-of-band | +| `work start <topic>` | write | create `.worktrees/<topic>` on `<user>/<topic>` off `<remote>/master`; enter with native `EnterWorktree` | +| `work land [--verify-cmd "…"] [--no-verify]` | write | merge master in → verify → push `HEAD:master` (non-ff retry; PR fallback) | +| `work clean <topic>` | write | remove a task's worktree + branch (run from the main checkout) | + +### v0.2 verbs — Kubernetes + +Built on an **app→namespace→pod resolver**: `<app>` defaults to the namespace +(most namespaces hold one app); the target defaults to `deploy/<app>` and lets +kubectl resolve the pod. Override with `-n`/`--pod`/`-c`/`-l`/`--tty`. Uses the +ambient kubeconfig. + +| Command | Tier | What it does | +|---|---|---| +| `k8s status [ns]` | read | pods (wide) + recent non-Normal events (`-A` if no ns) | +| `k8s get <ns> <resource> […]` | read | `kubectl -n <ns> get …` passthrough | +| `k8s logs <app>` | read | logs for `deploy/<app>` (`--tail` default 200; `-c`/`--previous`/`--since`/`-l`) | +| `k8s describe <app> [resource]` | read | describe the deployment (or an explicit resource) | +| `k8s debug <app>` | read | one-shot triage: pods + workloads + describe + recent logs + events | +| `k8s pf <app> <local:remote> [target]` | read | port-forward to `svc/<app>` (or an explicit target) | +| `k8s rollout-status <app>` | read | `rollout status deploy/<app>` | +| `k8s db <app> [--mysql] [--db N] -- "<SQL>"` | write | exec into the dbaas DB (PG `pg-cluster-rw`, or MySQL with env-password wrapper) | +| `k8s exec <app> [--tty] -- <cmd>` | write | exec in the app's pod | +| `k8s restart <app>` | write | `rollout restart deploy/<app>` then wait for status | +| `k8s rm-pod <name> -n <ns> [--job] [--force]` | write | delete a stuck **pod/job only** | + +Config-mutation verbs (`apply`/`edit`/`patch`/`scale`/`create`) are intentionally +**not** exposed — they stay raw `kubectl`, per the Terraform-only policy. + +`tf` resolves the stack dir by walking up from cwd to the infra root and +delegates to `scripts/tg` (which owns state decrypt/encrypt, the Vault lock, and +the ingress auth-comment check). git-crypt filter flags are auto-injected on git +operations in the encrypted infra repo. + +**`work land` refuses to push when it cannot verify** (no `--verify-cmd` and no +auto-detected suite) unless you pass `--no-verify` — landing to master unverified +must be deliberate. After pushing it **watches CI to green** (`ci watch` on the +landed commit) and fails if the pipeline does; pass `--no-ci-watch` to skip. + +Tiers are recorded per verb so a future PreToolUse classifier can auto-allow +reads / prompt writes; v0.1 allows everything and relies on existing gates +(permission mode, presence claims, plan approval). + +### v0.3 verbs — memory + +A thin HTTP client over the **claude-memory** service (the same backend the +memory MCP wraps), authed with `CLAUDE_MEMORY_API_KEY` against +`CLAUDE_MEMORY_API_URL` (the env the hooks already set; defaults to the +ingress). Because it hits the HTTP API directly, it **works even when the MCP +frontend is down**. + +| Command | Tier | What it does | +|---|---|---| +| `memory recall "<context>" [--query --category --sort --limit]` | read | semantic search (server-side ranking) — the navigate workhorse | +| `memory list [--category --tag --limit]` | read | recent memories | +| `memory categories` / `memory tags` / `memory stats` | read | enumerate the store | +| `memory secret <id>` | read | reveal a sensitive memory's content | +| `memory store "<content>" [--category --tags --keywords --importance --sensitive]` | write | store a memory | +| `memory update <id> [--content --tags --importance]` | write | edit a memory | +| `memory delete <id>` | write | delete a memory | + +All read/write paths are validated against the live API (incl. a +store→recall→delete round-trip). This gives full data-plane parity with the MCP; +the eventual deprecation (rewiring the per-prompt auto-recall + auto-learn hooks +to the CLI, then uninstalling the MCP) is a **separate, deliberate follow-up** — +see `docs/adr/0008`. + +### v0.4 verbs — ci / deploy + +Watch what you trigger, without hand-rolling Woodpecker/kubectl polling. `ci` +talks to the Woodpecker API (token from `WOODPECKER_TOKEN` or Vault +`secret/ci/global`) via the internal Traefik LB, resolving the repo from the cwd +remote, with retries that ride Woodpecker's intermittent empty responses. + +| Command | Tier | What it does | +|---|---|---| +| `ci status [commit]` | read | pipeline status for HEAD (or a commit) | +| `ci watch [commit]` | read | poll the pipeline to terminal; exit non-zero on failure | +| `deploy wait <ns>/<deploy> [--sha SHA]` | read | wait for the deployment image to match the sha, *then* rollout status (rollout status alone lies on the old ReplicaSet) | + +`work land` now calls `ci watch` on the landed commit automatically (skip with +`--no-ci-watch`), closing the v0.1 "doesn't wait for CI" gap. `ci logs` (failing +step) is deferred to v0.4.1 — Woodpecker's per-pipeline detail/log endpoints were +the least reliable; `status`/`watch` use the list endpoint that works. + +### v0.5 verbs — net / dns / metrics / logs + +Reachability + observability probes. Their value is *endpoint resolution* — the +non-obvious "which host, public or LB, what auth, what URL shape" reasoning you'd +otherwise re-derive every time — not the HTTP call itself. All reach internal +ingresses through the Traefik LB (the Go form of `curl --resolve host:443:10.0.20.203`). + +| Command | Tier | What it does | +|---|---|---| +| `net check <host> [path]` | read | probes the host two ways — external (public DNS → Cloudflare) vs internal (Traefik LB) — with status + latency, so you can tell *where* a break is (CF? app? the LB path?) | +| `dns lookup <name> [type]` | read | resolves via Technitium (`10.0.20.201`) and public (`1.1.1.1`), diffed — surfaces split-horizon vs propagation gaps | +| `metrics query "<promql>"` | read | Prometheus instant query (`prometheus-query.viktorbarzin.lan`); prints `value {labels}` or `--json` | +| `metrics alerts` | read | currently-firing alerts (via the synthetic `ALERTS` series — the query frontend has no `/api/v1/alerts`) | +| `logs query "<logql>" [--since 1h] [--limit N]` | read | Loki range query (`loki.viktorbarzin.lan`); prints log lines or `--json` | + +Quote the PromQL/LogQL. These hit auth-free internal ingresses — no port-forward, +no kubectl. (In-cluster-only endpoints like Alertmanager stay out of scope; the +firing set is reachable via `ALERTS` instead.) + +### v0.6 — usage telemetry (`usage top`) + +Makes "which verbs are actually used, by everyone" a query instead of a guess — +so adding the *next* verb is evidence-driven, not shaped by one person's habits. + +Every dispatched verb emits one fire-and-forget Loki line: `{job, user, verb}` +labels + `exit=N ver=X` — **only the verb path and exit code, never args, paths, +flags, or secrets.** It's best-effort (tight timeout, errors swallowed, never +affects the command) and opt-out via `HOMELAB_TELEMETRY=0`. Because the sink is +the shared Loki, aggregate usage is queryable **without reading anyone's home** — +the privacy-preserving answer to "what does the team use." + +| Command | Tier | What it does | +|---|---|---| +| `usage top [--since 30d] [--user U] [--json]` | read | rank verbs by invocation count across all users (or one), via `sum by (verb) (count_over_time({job="homelab-usage"}[…]))` | + +### v0.7 verbs — Home Assistant + +Cover exactly the two things the `ha` **MCP server can't**: resolving the +long-lived API token out of the cluster, and SSH to the HA host for host-level +work (config files, docker, add-ons). Entity state and control (`turn_on`, +`get_state`, services) stay with the MCP — *actions an MCP already encodes are +out of scope* (see top of this doc). The value here is the same as `net`/`dns`: +the non-obvious *which secret, which host, which key, which flags* you'd +otherwise re-derive every session — agents were hand-rolling a +`kubectl | base64 | jq` token pipeline and a bespoke `ssh -o …` invocation on +every run because the existing `home-assistant-sofia.py` needs an env var set +and a cwd-relative path, neither of which holds in an arbitrary session. + +| Command | Tier | What it does | +|---|---|---| +| `ha token [--instance sofia\|london]` | read | print the long-lived HA API token, resolved live from the dedicated k8s Secret `openclaw/ha-tokens` (key per instance) via the ambient kubeconfig — no pre-set env var. Use as `curl -H "Authorization: Bearer $(homelab ha token)" …`. The secret is a least-privilege carve-out (`stacks/openclaw/ha_tokens.tf`): the `Home Server Admins` group can read *just* it, so non-admin operators get the HA token without the rest of `skill_secrets` (slack webhook, uptime-kuma password) | +| `ha ssh [--instance sofia\|london] [-i KEY] -- <cmd>` | write | run `<cmd>` on the HA host over ssh with deterministic non-interactive flags (explicit key = the invoking user's `~/.ssh/id_ed25519`, no user ssh-config, no known_hosts prompt). sofia (`vbarzin@192.168.1.8`) is reachable from the devvm LAN; london is documented but generally remote | + +`--instance` defaults to **sofia** (the devvm shares the Sofia LAN). `ha token` +prints the bare token to stdout so it composes in `$(…)`; it's read-tier like +`memory secret`. `ha ssh` resolves the *invoking user's* key, so it's per-user, +not tied to whoever first wrote the workflow (the user's key must be enrolled on +the HA host). + +### v0.8 verbs — browser (headful anti-bot automation) + +Drive the cluster's **headful** Chrome (`chrome-service`, real Chrome under Xvfb) +from the devvm over CDP, for sites that detect and block headless automation. The +headless `@playwright/mcp` browser can *load* such a site and fill its forms, but +the gated action (submit/login) silently fails — the motivating case was the +Stirling Ackroyd Fixflo tenant portal, whose pre-submit check returned +`net::ERR_FILE_NOT_FOUND` and hung. This path connects via `connect_over_cdp`, +injects the same `stealth.js` the in-cluster callers use, and submits first try. + +The command owns only the *mechanics* (port-forward, stealth, lifecycle); the +agent supplies the Playwright script — judgment stays out of the CLI. + +| Command | Tier | What it does | +|---|---|---| +| `browser run <script.js> [--url U] [--shared-context] [--keep-open] [--port N] [--timeout S]` | write | port-forward `svc/chrome-service:9222`, assert it's a real (non-headless) Chrome via `/json/version`, `connect_over_cdp`, `addInitScript(stealth.js)`, then run the script with `page`/`context`/`browser`/`log` in scope (top-level await ok; return a value to print it). Always tears the forward down. | +| `browser open <url> [--shared-context] [--timeout S]` | write | open `<url>` headful and print title + visible text + a screenshot path — a quick check. | +| `browser --help` | read | when-to-use signature + the error-code cheat-sheet (`ERR_FILE_NOT_FOUND` = automation-layer intercept, not egress; `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` = real egress; one endpoint 500 while siblings 200 = bot rejection). | + +Default context is a **fresh incognito** one (closed on exit) — safe for the +shared browser and concurrent callers (e.g. tripit's fare scrape); `--shared-context` +reuses the warmed persistent profile when a pre-logged-in session is needed. +`port-forward` tunnels API-server→pod, so it bypasses the `:9222` NetworkPolicy +that gates in-cluster callers — no namespace label needed. The node CDP client is +pinned to **`playwright-core@1.48.2`** to match the chrome-service image minor +(Chromium 130; protocol changes between minors) and is installed once, lazily, +into `~/.cache/homelab/browser-client/` (no per-user setup). Because the client +runs on the devvm, `setInputFiles` streams local files to the remote browser over +CDP — no `chmod`/staging-dir workaround. See `docs/architecture/chrome-service.md` +and `docs/adr/0013`. + +## Build / install + +Built from source to `/usr/local/bin/homelab` during devvm provisioning +(`scripts/workstation/setup-devvm.sh`, the `t3-dispatch` pattern); version is +stamped from `cli/VERSION` via ldflags. Manual build: + +``` +cd cli && go build -ldflags "-X main.version=$(cat VERSION)" -o /usr/local/bin/homelab . +go test ./... +``` + +## Legacy webhook use-cases (preserved) + +This binary is also the in-cluster `infra-cli` image. Invocations starting with +`-use-case=<vpn|setup-openwrt-dns|add-email-alias|...>` fall through to the +original flag-based path unchanged, so the webhook handler is unaffected. + +## Design + +See `infra/docs/adr/0004`–`0013` for the architecture decisions. diff --git a/cli/VERSION b/cli/VERSION new file mode 100644 index 00000000..85f7059b --- /dev/null +++ b/cli/VERSION @@ -0,0 +1 @@ +v0.8.1 diff --git a/cli/browser.go b/cli/browser.go new file mode 100644 index 00000000..39b6b0a0 --- /dev/null +++ b/cli/browser.go @@ -0,0 +1,388 @@ +package main + +import ( + _ "embed" + "encoding/json" + "fmt" + "io" + "net" + "net/http" + "os" + "os/exec" + "os/signal" + "path/filepath" + "strconv" + "strings" + "sync" + "syscall" + "time" +) + +// playwrightVersion pins the node CDP client to the chrome-service image minor +// (mcr.microsoft.com/playwright:v1.48.0-noble → Chromium 130). connect_over_cdp +// speaks the browser's CDP, so the client minor must track the server minor; +// see docs/architecture/chrome-service.md "Image pin". +const playwrightVersion = "1.48.2" + +// defaultBrowserTimeout is how long (seconds) to wait for the port-forwarded CDP +// endpoint to become ready before giving up. +const defaultBrowserTimeout = 60 + +const ( + chromeServiceNamespace = "chrome-service" + chromeServiceName = "chrome-service" + chromeServiceCDPPort = 9222 +) + +// stealthJS is vendored verbatim from stacks/chrome-service/files/stealth.js (the +// source of truth the in-cluster callers use). TestStealthJSEmbeddedMatchesCanonical +// guards against drift. +// +//go:embed browser_stealth.js +var stealthJS string + +// runnerJS is the node wrapper that connects to the port-forwarded CDP endpoint, +// installs the stealth init script, and runs the user's Playwright script. +// +//go:embed browser_runner.js +var runnerJS string + +// browserOpts is the parsed form of `homelab browser run|open` arguments. +type browserOpts struct { + mode string // "run" | "open" + script string // path to the user Playwright script (run mode) + url string // initial URL (run: optional; open: required positional) + sharedCtx bool // use the warmed persistent profile instead of a fresh context + keepOpen bool // leave the created context/pages open on exit + port int // explicit local port for the forward (0 = auto) + timeout int // CDP readiness timeout, seconds + help bool +} + +// parseBrowserArgs parses the args after `browser run` / `browser open`. +func parseBrowserArgs(mode string, args []string) (browserOpts, error) { + o := browserOpts{mode: mode, timeout: defaultBrowserTimeout} + var positionals []string + atoi := func(s, flag string) (int, error) { + n, err := strconv.Atoi(s) + if err != nil { + return 0, fmt.Errorf("%s expects an integer, got %q", flag, s) + } + return n, nil + } + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "-h" || a == "--help": + o.help = true + case a == "--shared-context": + o.sharedCtx = true + case a == "--keep-open": + o.keepOpen = true + case a == "--url": + if i+1 < len(args) { + o.url = args[i+1] + i++ + } + case strings.HasPrefix(a, "--url="): + o.url = strings.TrimPrefix(a, "--url=") + case a == "--port": + if i+1 < len(args) { + n, err := atoi(args[i+1], "--port") + if err != nil { + return o, err + } + o.port = n + i++ + } + case strings.HasPrefix(a, "--port="): + n, err := atoi(strings.TrimPrefix(a, "--port="), "--port") + if err != nil { + return o, err + } + o.port = n + case a == "--timeout": + if i+1 < len(args) { + n, err := atoi(args[i+1], "--timeout") + if err != nil { + return o, err + } + o.timeout = n + i++ + } + case strings.HasPrefix(a, "--timeout="): + n, err := atoi(strings.TrimPrefix(a, "--timeout="), "--timeout") + if err != nil { + return o, err + } + o.timeout = n + case strings.HasPrefix(a, "-"): + return o, fmt.Errorf("unknown flag %q (try: homelab browser --help)", a) + default: + positionals = append(positionals, a) + } + } + if o.help { + return o, nil + } + switch mode { + case "run": + if len(positionals) == 0 { + return o, fmt.Errorf("usage: homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S]") + } + o.script = positionals[0] + case "open": + if len(positionals) == 0 { + return o, fmt.Errorf("usage: homelab browser open <url> [--shared-context] [--timeout S]") + } + o.url = positionals[0] + } + return o, nil +} + +// cdpHealthy parses a CDP /json/version body and reports whether the endpoint is +// a real (non-headless) Chrome — the entire reason chrome-service exists. +func cdpHealthy(jsonBody []byte) (browser string, healthy bool, err error) { + var v struct { + Browser string `json:"Browser"` + UserAgent string `json:"User-Agent"` + } + if e := json.Unmarshal(jsonBody, &v); e != nil { + return "", false, fmt.Errorf("parse /json/version: %w", e) + } + if v.Browser == "" { + return "", false, fmt.Errorf("/json/version had no Browser field") + } + healthy = strings.HasPrefix(v.Browser, "Chrome/") && + !strings.Contains(v.Browser, "Headless") && + !strings.Contains(v.UserAgent, "Headless") + return v.Browser, healthy, nil +} + +// buildPortForwardArgs is the kubectl invocation that exposes chrome-service's +// CDP locally. port-forward tunnels API-server→pod, so it bypasses the :9222 +// NetworkPolicy that gates in-cluster callers. +func buildPortForwardArgs(localPort int) []string { + return []string{"-n", chromeServiceNamespace, "port-forward", + "svc/" + chromeServiceName, fmt.Sprintf("%d:%d", localPort, chromeServiceCDPPort)} +} + +// browserClientPackageJSON is the auto-managed manifest for the pinned node CDP +// client kept under the user cache dir. +func browserClientPackageJSON() string { + return fmt.Sprintf(`{ + "name": "homelab-browser-client", + "private": true, + "description": "Pinned CDP client for 'homelab browser' — auto-managed, do not edit.", + "dependencies": { + "playwright-core": "%s" + } +} +`, playwrightVersion) +} + +// freePort asks the kernel for an unused ephemeral TCP port. +func freePort() (int, error) { + l, err := net.Listen("tcp", "127.0.0.1:0") + if err != nil { + return 0, err + } + defer l.Close() + return l.Addr().(*net.TCPAddr).Port, nil +} + +// browserClientDir is where the pinned node client + managed runner files live. +func browserClientDir() (string, error) { + cache, err := os.UserCacheDir() + if err != nil || cache == "" { + home, herr := os.UserHomeDir() + if herr != nil { + return "", fmt.Errorf("locate cache dir: %v / %v", err, herr) + } + cache = filepath.Join(home, ".cache") + } + return filepath.Join(cache, "homelab", "browser-client"), nil +} + +// installedPlaywrightVersion reads the version of the playwright-core already +// installed in dir, or "" if absent/unreadable. +func installedPlaywrightVersion(dir string) string { + b, err := os.ReadFile(filepath.Join(dir, "node_modules", "playwright-core", "package.json")) + if err != nil { + return "" + } + var v struct { + Version string `json:"version"` + } + if json.Unmarshal(b, &v) != nil { + return "" + } + return v.Version +} + +// ensureBrowserClient writes the managed runner/stealth/package files into dir +// and lazily installs the pinned playwright-core (only when missing/mismatched), +// so no per-user setup is needed and the client tracks the binary version. +func ensureBrowserClient(dir string) error { + if err := os.MkdirAll(dir, 0o755); err != nil { + return err + } + files := map[string]string{ + "package.json": browserClientPackageJSON(), + "browser_runner.js": runnerJS, + "stealth.js": stealthJS, + } + for name, content := range files { + if err := os.WriteFile(filepath.Join(dir, name), []byte(content), 0o644); err != nil { + return err + } + } + if installedPlaywrightVersion(dir) == playwrightVersion { + return nil + } + fmt.Fprintf(os.Stderr, "homelab browser: installing pinned playwright-core@%s (one-time, ~a few seconds)…\n", playwrightVersion) + cmd := exec.Command("npm", "install", "--no-audit", "--no-fund", "--silent") + cmd.Dir = dir + cmd.Stdout = os.Stderr + cmd.Stderr = os.Stderr + if err := cmd.Run(); err != nil { + return fmt.Errorf("npm install playwright-core@%s in %s: %w (is node/npm installed?)", playwrightVersion, dir, err) + } + if got := installedPlaywrightVersion(dir); got != playwrightVersion { + return fmt.Errorf("playwright-core install mismatch in %s: want %s, got %q", dir, playwrightVersion, got) + } + return nil +} + +// waitForCDP polls the local CDP endpoint until it answers as a healthy +// (non-headless) Chrome, or the timeout elapses. +func waitForCDP(cdpURL string, timeout time.Duration) (string, error) { + deadline := time.Now().Add(timeout) + client := &http.Client{Timeout: 3 * time.Second} + var lastErr error + for time.Now().Before(deadline) { + resp, err := client.Get(cdpURL + "/json/version") + if err != nil { + lastErr = err + time.Sleep(300 * time.Millisecond) + continue + } + body, _ := io.ReadAll(resp.Body) + resp.Body.Close() + browser, healthy, herr := cdpHealthy(body) + if herr != nil { + lastErr = herr + time.Sleep(300 * time.Millisecond) + continue + } + if !healthy { + return browser, fmt.Errorf("CDP reports %q — expected a non-headless Chrome (wrong target?)", browser) + } + return browser, nil + } + if lastErr == nil { + lastErr = fmt.Errorf("timed out after %s", timeout) + } + return "", lastErr +} + +// runBrowser is the orchestration: pick a port, ensure the pinned client, start +// (and ALWAYS tear down) a CDP port-forward, wait for readiness, then run node. +func runBrowser(o browserOpts) error { + port := o.port + if port == 0 { + p, err := freePort() + if err != nil { + return fmt.Errorf("pick local port: %w", err) + } + port = p + } + + dir, err := browserClientDir() + if err != nil { + return err + } + if err := ensureBrowserClient(dir); err != nil { + return err + } + + // Start the forward in its own process group so the whole tree dies on cleanup. + pf := exec.Command("kubectl", buildPortForwardArgs(port)...) + pf.SysProcAttr = &syscall.SysProcAttr{Setpgid: true} + var pfLog strings.Builder + pf.Stdout = &pfLog + pf.Stderr = &pfLog + if err := pf.Start(); err != nil { + return fmt.Errorf("start kubectl port-forward (kubeconfig set?): %w", err) + } + + var once sync.Once + teardown := func() { + once.Do(func() { + if pf.Process != nil { + _ = syscall.Kill(-pf.Process.Pid, syscall.SIGKILL) + } + _ = pf.Wait() + }) + } + defer teardown() + + // Tear down on Ctrl-C / SIGTERM too, then exit non-zero. + sigCh := make(chan os.Signal, 1) + signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM) + defer signal.Stop(sigCh) + go func() { + if _, ok := <-sigCh; ok { + teardown() + os.Exit(130) + } + }() + + cdpURL := fmt.Sprintf("http://127.0.0.1:%d", port) + browser, err := waitForCDP(cdpURL, time.Duration(o.timeout)*time.Second) + if err != nil { + return fmt.Errorf("chrome-service CDP not ready on %s: %w\n--- port-forward log ---\n%s", cdpURL, err, pfLog.String()) + } + fmt.Fprintf(os.Stderr, "homelab browser: connected to %s via %s\n", browser, cdpURL) + + return runBrowserNode(dir, cdpURL, o) +} + +// runBrowserNode invokes the managed node runner with inputs passed via env. +func runBrowserNode(dir, cdpURL string, o browserOpts) error { + env := append(os.Environ(), + "HOMELAB_CDP_URL="+cdpURL, + "HOMELAB_BROWSER_MODE="+o.mode, + "HOMELAB_STEALTH_PATH="+filepath.Join(dir, "stealth.js"), + "NODE_PATH="+filepath.Join(dir, "node_modules"), + ) + if o.url != "" { + env = append(env, "HOMELAB_BROWSER_URL="+o.url) + } + if o.script != "" { + abs, err := filepath.Abs(o.script) + if err != nil { + return err + } + if _, err := os.Stat(abs); err != nil { + return fmt.Errorf("script %s: %w", o.script, err) + } + env = append(env, "HOMELAB_BROWSER_SCRIPT="+abs) + } + if o.sharedCtx { + env = append(env, "HOMELAB_BROWSER_SHARED=1") + } + if o.keepOpen { + env = append(env, "HOMELAB_BROWSER_KEEP_OPEN=1") + } + if o.mode == "open" { + shot := filepath.Join(os.TempDir(), fmt.Sprintf("homelab-browser-%d.png", os.Getpid())) + env = append(env, "HOMELAB_BROWSER_SCREENSHOT="+shot) + } + cmd := exec.Command("node", filepath.Join(dir, "browser_runner.js")) + cmd.Env = env + cmd.Stdout = os.Stdout + cmd.Stderr = os.Stderr + cmd.Stdin = os.Stdin + return cmd.Run() +} diff --git a/cli/browser_runner.js b/cli/browser_runner.js new file mode 100644 index 00000000..24a2db6b --- /dev/null +++ b/cli/browser_runner.js @@ -0,0 +1,106 @@ +// homelab browser — node CDP runner (auto-managed; regenerated each run from the +// homelab binary — DO NOT EDIT here). Connects to the port-forwarded +// chrome-service CDP endpoint, installs the stealth init script, then runs the +// user's Playwright script (run mode) or opens a URL (open mode). All inputs +// arrive via HOMELAB_* env vars set by the Go CLI. +'use strict'; +const fs = require('fs'); +const { chromium } = require('playwright-core'); + +async function main() { + const cdpURL = process.env.HOMELAB_CDP_URL; + if (!cdpURL) throw new Error('HOMELAB_CDP_URL not set'); + const mode = process.env.HOMELAB_BROWSER_MODE || 'run'; + const stealthPath = process.env.HOMELAB_STEALTH_PATH || ''; + const initURL = process.env.HOMELAB_BROWSER_URL || ''; + const scriptPath = process.env.HOMELAB_BROWSER_SCRIPT || ''; + const shared = process.env.HOMELAB_BROWSER_SHARED === '1'; + const keepOpen = process.env.HOMELAB_BROWSER_KEEP_OPEN === '1'; + const screenshotPath = process.env.HOMELAB_BROWSER_SCREENSHOT || ''; + + const browser = await chromium.connectOverCDP(cdpURL); + + // Fresh isolated context by default (safe for the shared browser + concurrent + // callers); --shared-context reuses the warmed persistent profile. + let context; + let createdContext = false; + if (shared) { + const existing = browser.contexts(); + if (existing.length) { + context = existing[0]; + } else { + context = await browser.newContext(); + createdContext = true; + } + } else { + context = await browser.newContext(); + createdContext = true; + } + + if (stealthPath) { + const stealth = fs.readFileSync(stealthPath, 'utf8'); + if (stealth.trim()) await context.addInitScript(stealth); + } + + const page = await context.newPage(); + const log = (...a) => console.error('[browser]', ...a); + + let exitCode = 0; + try { + if (initURL) { + await page.goto(initURL, { waitUntil: 'domcontentloaded' }); + } + if (mode === 'open') { + console.log('url: ' + page.url()); + console.log('title: ' + (await page.title())); + const text = (await page.evaluate(() => (document.body ? document.body.innerText : ''))).trim(); + console.log('--- visible text (truncated to 4000 chars) ---'); + console.log(text.slice(0, 4000)); + if (screenshotPath) { + await page.screenshot({ path: screenshotPath, fullPage: true }); + console.log('screenshot: ' + screenshotPath); + } + } else { + if (!scriptPath) throw new Error('run mode requires HOMELAB_BROWSER_SCRIPT'); + const src = fs.readFileSync(scriptPath, 'utf8'); + // Run the user's source with page/context/browser/log in lexical scope. + // AsyncFunction body permits top-level await. + const AsyncFunction = Object.getPrototypeOf(async () => {}).constructor; + const fn = new AsyncFunction('page', 'context', 'browser', 'log', src); + const result = await fn(page, context, browser, log); + if (result !== undefined) { + let out; + try { + out = typeof result === 'string' ? result : JSON.stringify(result, null, 2); + } catch (_) { + out = String(result); + } + console.log(out); + } + } + } catch (e) { + console.error('homelab browser: script error:', e && e.stack ? e.stack : e); + exitCode = 1; + } finally { + if (!keepOpen) { + try { + // Close only what we created; never tear down the shared persistent context. + if (createdContext) { + await context.close(); + } else { + await page.close(); + } + } catch (_) { /* ignore */ } + } + // Disconnect from the CDP endpoint; this does NOT kill the remote browser. + try { + await browser.close(); + } catch (_) { /* ignore */ } + } + process.exit(exitCode); +} + +main().catch((e) => { + console.error('homelab browser: fatal:', e && e.stack ? e.stack : e); + process.exit(1); +}); diff --git a/cli/browser_stealth.js b/cli/browser_stealth.js new file mode 100644 index 00000000..dfae98a8 --- /dev/null +++ b/cli/browser_stealth.js @@ -0,0 +1,54 @@ +// Minimal stealth init script for Playwright-driven Chromium. +// Vendored from puppeteer-extra-plugin-stealth/evasions/* (MIT) — covers: +// webdriver, chrome.runtime, navigator.plugins, navigator.languages, +// Permissions.query, WebGL getParameter (vendor + renderer spoof). +// Run via context.add_init_script() so it executes before any page script. +(() => { + // navigator.webdriver — most common detection, removed entirely. + Object.defineProperty(Navigator.prototype, 'webdriver', { get: () => undefined }); + + // window.chrome.runtime — many sites check that real Chrome exposes this. + if (!window.chrome) window.chrome = {}; + window.chrome.runtime = window.chrome.runtime || {}; + + // navigator.plugins — headless reports zero; spoof a plausible PDF viewer. + Object.defineProperty(navigator, 'plugins', { + get: () => [{ name: 'Chrome PDF Plugin' }, { name: 'Chrome PDF Viewer' }, { name: 'Native Client' }], + }); + + // navigator.languages — headless returns empty array. + Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] }); + + // Permissions.query — headless returns 'denied' for notifications instead of 'default'. + const origQuery = window.navigator.permissions && window.navigator.permissions.query; + if (origQuery) { + window.navigator.permissions.query = (parameters) => + parameters && parameters.name === 'notifications' + ? Promise.resolve({ state: Notification.permission }) + : origQuery(parameters); + } + + // WebGL getParameter — spoof vendor + renderer strings to a real GPU. + const spoofGl = (proto) => { + if (!proto) return; + const orig = proto.getParameter; + proto.getParameter = function (parameter) { + if (parameter === 37445) return 'Intel Inc.'; // UNMASKED_VENDOR_WEBGL + if (parameter === 37446) return 'Intel Iris OpenGL Engine'; // UNMASKED_RENDERER_WEBGL + return orig.apply(this, arguments); + }; + }; + spoofGl(window.WebGLRenderingContext && window.WebGLRenderingContext.prototype); + spoofGl(window.WebGL2RenderingContext && window.WebGL2RenderingContext.prototype); + + // disable-devtool.js (theajack/disable-devtool) auto-inits via a script + // tag with `disable-devtool-auto`. Its Performance detector trips under + // Playwright (CDP adds console.log latency vs console.table) and the + // redirect URL is hard-coded — for hmembeds that's google.com. + // Hide the auto-init marker so the library's IIFE exits early. + const origQS = Document.prototype.querySelector; + Document.prototype.querySelector = function (sel) { + if (typeof sel === 'string' && sel.indexOf('disable-devtool-auto') !== -1) return null; + return origQS.apply(this, arguments); + }; +})(); diff --git a/cli/cmd_browser.go b/cli/cmd_browser.go new file mode 100644 index 00000000..4263e4d0 --- /dev/null +++ b/cli/cmd_browser.go @@ -0,0 +1,117 @@ +package main + +import "fmt" + +// browser verbs drive the cluster's HEADFUL Chrome (ns chrome-service) over CDP +// from outside the cluster, for sites that detect/block headless automation. +// The headless @playwright/mcp browser can load such sites but their gated +// actions (submit/login) silently fail; this path submits first try. Mechanics +// only — the agent supplies the Playwright script. See docs/adr/0013. + +func browserCommands() []Command { + return []Command{ + {Path: []string{"browser"}, Tier: TierRead, + Summary: "headful cluster-Chrome automation for anti-bot sites (run `browser --help`)", Run: browserTopHelp}, + {Path: []string{"browser", "run"}, Tier: TierWrite, + Summary: "run a Playwright script against headful cluster Chrome: browser run <script.js> [--url U] [--shared-context]", Run: browserRun}, + {Path: []string{"browser", "open"}, Tier: TierWrite, + Summary: "open a URL in headful cluster Chrome; print title + text + screenshot: browser open <url>", Run: browserOpen}, + } +} + +func browserTopHelp([]string) error { + fmt.Print(browserHelp()) + return nil +} + +func browserRun(args []string) error { + o, err := parseBrowserArgs("run", args) + if err != nil { + return err + } + if o.help { + fmt.Print(browserHelp()) + return nil + } + return runBrowser(o) +} + +func browserOpen(args []string) error { + o, err := parseBrowserArgs("open", args) + if err != nil { + return err + } + if o.help { + fmt.Print(browserHelp()) + return nil + } + return runBrowser(o) +} + +// browserHelp carries the discoverability payload: WHEN to reach for this, and +// the diagnostic cheat-sheet that lets the agent self-correct instead of +// retrying a deterministic form blind (the failure mode that motivated this). +func browserHelp() string { + return `homelab browser — drive the cluster's HEADFUL Chrome (anti-bot) over CDP + +The shared chrome-service (ns chrome-service) runs a REAL, headed Chrome under +Xvfb. This connects to it via a port-forward + Playwright connect_over_cdp, +injects the same stealth.js the in-cluster callers use, and runs your script. + +USAGE + homelab browser run <script.js> [--url URL] [--shared-context] [--keep-open] [--port N] [--timeout S] + homelab browser open <url> [--shared-context] [--timeout S] + +WHEN TO USE THIS — escalation only; DEFAULT to the headless/MCP browser + Default to the Playwright MCP / headless browser for ALL routine browsing and + automation — it's interactive (snapshot per step), fast to start, isolated. + Reach for THIS command ONLY when headless is demonstrably blocked: a site + LOADS fine but a gated action FAILS or HANGS — a submit/login/checkout spins + forever, or ONE request errors while its siblings 200. That is the signature + of headless / anti-bot detection (navigator.webdriver, UA "HeadlessChrome", + disable-devtool traps). It presents as a real Chrome and usually succeeds + first try — but it's the shared cluster browser (slower startup, one batch + run, no per-step feedback), so it's the escalation path, never the default. + +ERROR-CODE CHEAT-SHEET (diagnose BEFORE retrying) + ERR_FILE_NOT_FOUND (-6) request intercepted/resolved locally by the + automation layer — NOT a network/egress problem. + (This is what silently broke the headless submit.) + ERR_CONNECTION_REFUSED / real egress failure (DNS/route/firewall). These also + ERR_TIMED_OUT / break the initial page load — if the page loaded, + ERR_NAME_NOT_RESOLVED egress is fine and the cause is elsewhere. + one endpoint 500s while server-side bot rejection of the automation, not + its siblings 200 your payload. + +HABITS + - Inspect the network panel BEFORE retrying a deterministic form; a blind + retry just repeats the same silent failure. + - Don't park a half-filled multi-step form across a user pause — the session + can expire; re-run the whole flow from this command in one shot. + - Uploads stream over CDP via setInputFiles from THIS host — no chmod/staging + of $HOME needed; just point setInputFiles at a local path. + +CONTEXT + Default: a FRESH incognito context, closed on exit — safe for the shared + browser and concurrent callers (e.g. tripit). Your script does its own login. + --shared-context: reuse the warmed PERSISTENT profile (cookies from a manual + noVNC login at chrome.viktorbarzin.me) when you need a pre-logged-in session. + +SCRIPT CONTRACT (run mode) + Your file's body runs with page, context, browser and log() already in scope + (top-level await allowed). Return a value to print it. Example flow.js: + + await page.goto('https://portal.example.com/login'); + await page.fill('#user', 'me'); await page.fill('#pass', process.env.PW); + await page.click('button[type=submit]'); + await page.waitForURL('**/dashboard'); + return 'logged in: ' + page.url(); + + Run it: homelab browser run flow.js + +NOTES + - The Playwright client is pinned to playwright-core@` + playwrightVersion + ` to match the + chrome-service image (Chrome 130); installed once into ~/.cache/homelab/. + - The port-forward is always torn down, on success and on error. +` +} diff --git a/cli/cmd_browser_test.go b/cli/cmd_browser_test.go new file mode 100644 index 00000000..668897d3 --- /dev/null +++ b/cli/cmd_browser_test.go @@ -0,0 +1,172 @@ +package main + +import ( + "os" + "reflect" + "strings" + "testing" +) + +func TestParseBrowserArgsRun(t *testing.T) { + got, err := parseBrowserArgs("run", []string{ + "flow.js", "--url", "https://example.com", "--shared-context", + "--port", "19999", "--timeout", "45", "--keep-open", + }) + if err != nil { + t.Fatalf("parseBrowserArgs run: unexpected err: %v", err) + } + want := browserOpts{ + mode: "run", script: "flow.js", url: "https://example.com", + sharedCtx: true, keepOpen: true, port: 19999, timeout: 45, + } + if !reflect.DeepEqual(got, want) { + t.Fatalf("parseBrowserArgs run =\n %+v\nwant\n %+v", got, want) + } +} + +func TestParseBrowserArgsRunDefaults(t *testing.T) { + got, err := parseBrowserArgs("run", []string{"flow.js"}) + if err != nil { + t.Fatalf("unexpected err: %v", err) + } + if got.script != "flow.js" || got.sharedCtx || got.keepOpen || got.port != 0 { + t.Fatalf("defaults wrong: %+v", got) + } + if got.timeout != defaultBrowserTimeout { + t.Fatalf("timeout default = %d, want %d", got.timeout, defaultBrowserTimeout) + } +} + +func TestParseBrowserArgsRunRequiresScript(t *testing.T) { + if _, err := parseBrowserArgs("run", []string{"--url", "https://x"}); err == nil { + t.Fatalf("run without a script path should error") + } +} + +func TestParseBrowserArgsOpenRequiresURL(t *testing.T) { + got, err := parseBrowserArgs("open", []string{"https://example.com"}) + if err != nil { + t.Fatalf("unexpected err: %v", err) + } + if got.url != "https://example.com" || got.mode != "open" { + t.Fatalf("open parse wrong: %+v", got) + } + if _, err := parseBrowserArgs("open", []string{}); err == nil { + t.Fatalf("open without a URL should error") + } +} + +func TestParseBrowserArgsHelp(t *testing.T) { + for _, a := range [][]string{{"--help"}, {"-h"}, {"flow.js", "--help"}} { + got, err := parseBrowserArgs("run", a) + if err != nil { + t.Fatalf("help parse %v: %v", a, err) + } + if !got.help { + t.Fatalf("args %v should set help", a) + } + } +} + +func TestParseBrowserArgsEqualsForm(t *testing.T) { + got, err := parseBrowserArgs("run", []string{"flow.js", "--url=https://x", "--port=8123", "--timeout=10"}) + if err != nil { + t.Fatalf("unexpected err: %v", err) + } + if got.url != "https://x" || got.port != 8123 || got.timeout != 10 { + t.Fatalf("--flag=value form not parsed: %+v", got) + } +} + +func TestCDPHealthy(t *testing.T) { + real := []byte(`{"Browser":"Chrome/130.0.6723.31","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) Chrome/130.0.0.0 Safari/537.36","webSocketDebuggerUrl":"ws://127.0.0.1/devtools/browser/x"}`) + browser, ok, err := cdpHealthy(real) + if err != nil || !ok { + t.Fatalf("real Chrome should be healthy: ok=%v err=%v", ok, err) + } + if !strings.HasPrefix(browser, "Chrome/") { + t.Fatalf("browser = %q, want Chrome/ prefix", browser) + } + + headless := []byte(`{"Browser":"HeadlessChrome/130.0.6723.31","User-Agent":"Mozilla/5.0 HeadlessChrome/130.0.0.0"}`) + if _, ok, _ := cdpHealthy(headless); ok { + t.Fatalf("HeadlessChrome must be reported unhealthy (the whole point of chrome-service)") + } + + if _, _, err := cdpHealthy([]byte("not json")); err == nil { + t.Fatalf("malformed /json/version body should error") + } +} + +func TestBuildPortForwardArgs(t *testing.T) { + got := buildPortForwardArgs(18080) + want := []string{"-n", "chrome-service", "port-forward", "svc/chrome-service", "18080:9222"} + if !reflect.DeepEqual(got, want) { + t.Fatalf("buildPortForwardArgs =\n %v\nwant\n %v", got, want) + } +} + +func TestBrowserClientPackageJSONPinsVersion(t *testing.T) { + pj := browserClientPackageJSON() + if !strings.Contains(pj, `"playwright-core": "`+playwrightVersion+`"`) { + t.Fatalf("package.json must pin playwright-core to %s; got:\n%s", playwrightVersion, pj) + } +} + +func TestPlaywrightVersionPinnedToServerMinor(t *testing.T) { + // chrome-service runs mcr.microsoft.com/playwright:v1.48.0-noble; the CDP + // client minor MUST match (protocol changes between minors). + if !strings.HasPrefix(playwrightVersion, "1.48.") { + t.Fatalf("playwrightVersion = %q, must be 1.48.x to match the chrome-service image", playwrightVersion) + } +} + +func TestBrowserHelpHasDiagnosticCheatSheet(t *testing.T) { + h := browserHelp() + for _, want := range []string{ + "homelab browser run", + "ERR_FILE_NOT_FOUND", + "ERR_CONNECTION_REFUSED", + "network panel", + "headless", + "--shared-context", + } { + if !strings.Contains(h, want) { + t.Errorf("browser --help is missing %q (the discoverability/self-correction payload)", want) + } + } +} + +func TestBrowserHelpIsTiered(t *testing.T) { + // --help must frame this as the ESCALATION path (default to headless first), + // matching ~/code/CLAUDE.md and chrome-service.md — non-conflicting agent + // instructions. Guard against a regression to "co-equal choice" wording. + h := browserHelp() + for _, want := range []string{"Default to the", "escalation"} { + if !strings.Contains(h, want) { + t.Errorf("browser --help must carry the tiered/default-headless framing; missing %q", want) + } + } +} + +func TestStealthJSEmbeddedMatchesCanonical(t *testing.T) { + // The embedded copy must never drift from the source of truth that the + // in-cluster callers use, else the CLI's stealth and the cluster's diverge. + canonical, err := os.ReadFile("../stacks/chrome-service/files/stealth.js") + if err != nil { + t.Fatalf("read canonical stealth.js: %v", err) + } + if stealthJS != string(canonical) { + t.Fatalf("cli/browser_stealth.js has drifted from stacks/chrome-service/files/stealth.js — re-copy it") + } +} + +func TestFreePortReturnsUsablePort(t *testing.T) { + p, err := freePort() + if err != nil { + t.Fatalf("freePort: %v", err) + } + if p <= 1024 || p > 65535 { + t.Fatalf("freePort returned %d, want an ephemeral port", p) + } +} diff --git a/cli/cmd_ci.go b/cli/cmd_ci.go new file mode 100644 index 00000000..66d4902d --- /dev/null +++ b/cli/cmd_ci.go @@ -0,0 +1,99 @@ +package main + +import ( + "fmt" + "os" + "strings" + "time" +) + +func ciCommands() []Command { + return []Command{ + {Path: []string{"ci", "status"}, Tier: TierRead, + Summary: "pipeline status for HEAD/a commit: ci status [commit]", Run: ciStatus}, + {Path: []string{"ci", "watch"}, Tier: TierRead, + Summary: "poll the pipeline for HEAD (or a commit) to terminal; non-zero on failure", Run: ciWatch}, + } +} + +func short(s string) string { + if len(s) > 8 { + return s[:8] + } + return s +} + +func firstLine(s string) string { return strings.SplitN(s, "\n", 2)[0] } + +// currentHEAD returns the full HEAD sha of the cwd repo (empty if not a repo). +func currentHEAD() string { + cwd, _ := os.Getwd() + root, err := gitRepoRoot(cwd) + if err != nil { + return "" + } + sha, _ := gitOutput(root, "rev-parse", "HEAD") + return sha +} + +func ciStatus(args []string) error { + commit, _ := firstPositional(args) + c, err := newWPClient() + if err != nil { + return err + } + id, err := c.repoID() + if err != nil { + return err + } + p, err := c.findPipeline(id, commit) + if err != nil { + return err + } + fmt.Printf("#%d %s event=%s %s %s\n", p.Number, p.Status, p.Event, short(p.Commit), firstLine(p.Message)) + return nil +} + +func ciWatch(args []string) error { + commit, _ := firstPositional(args) + if commit == "" { + commit = currentHEAD() + } + if commit == "" { + return fmt.Errorf("no commit given and not in a git repo") + } + c, err := newWPClient() + if err != nil { + return err + } + id, err := c.repoID() + if err != nil { + return err + } + timeout := 20 * time.Minute + deadline := time.Now().Add(timeout) + last := "" + for time.Now().Before(deadline) { + p, err := c.findPipeline(id, commit) + if err != nil { + if last != "waiting" { + fmt.Fprintf(os.Stderr, "homelab: waiting for pipeline (%s)...\n", short(commit)) + last = "waiting" + } + } else { + if p.Status != last { + fmt.Fprintf(os.Stderr, "homelab: #%d %s\n", p.Number, p.Status) + last = p.Status + } + if isTerminalStatus(p.Status) { + fmt.Printf("#%d %s %s\n", p.Number, p.Status, short(commit)) + if isFailureStatus(p.Status) { + return fmt.Errorf("pipeline #%d %s (woodpecker repo, see UI/DB for the failing step)", p.Number, p.Status) + } + return nil + } + } + time.Sleep(15 * time.Second) + } + return fmt.Errorf("timed out after %s waiting for CI on %s", timeout, short(commit)) +} diff --git a/cli/cmd_claim.go b/cli/cmd_claim.go new file mode 100644 index 00000000..e11a37db --- /dev/null +++ b/cli/cmd_claim.go @@ -0,0 +1,56 @@ +package main + +import ( + "fmt" + "strings" +) + +func claimCommands() []Command { + return []Command{ + {Path: []string{"claim"}, Tier: TierWrite, + Summary: "claim a shared infra resource on the presence board", + Run: runClaim}, + {Path: []string{"release"}, Tier: TierWrite, + Summary: "release a presence claim", + Run: runRelease}, + } +} + +// runClaim parses `<kind>:<name> --purpose "..."` in either order (the presence +// script takes the label first, so we can't rely on Go's flag package which +// stops at the first positional). +func runClaim(args []string) error { + var label, purpose string + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--purpose" || a == "-purpose": + if i+1 < len(args) { + purpose = args[i+1] + i++ + } + case strings.HasPrefix(a, "--purpose="): + purpose = strings.TrimPrefix(a, "--purpose=") + case !strings.HasPrefix(a, "-") && label == "": + label = a + } + } + if label == "" { + return fmt.Errorf(`usage: homelab claim <kind>:<name> --purpose "what + why"`) + } + return presenceClaim(label, purpose) +} + +func runRelease(args []string) error { + var label string + for _, a := range args { + if !strings.HasPrefix(a, "-") { + label = a + break + } + } + if label == "" { + return fmt.Errorf("usage: homelab release <kind>:<name>") + } + return presenceRelease(label) +} diff --git a/cli/cmd_deploy.go b/cli/cmd_deploy.go new file mode 100644 index 00000000..d5afc4a8 --- /dev/null +++ b/cli/cmd_deploy.go @@ -0,0 +1,51 @@ +package main + +import ( + "fmt" + "os" + "strings" + "time" +) + +func deployCommands() []Command { + return []Command{ + {Path: []string{"deploy", "wait"}, Tier: TierRead, + Summary: "wait for <ns>/<deploy> to roll out the current (or --sha) image: deploy wait <ns>/<deploy> [--sha SHA]", Run: deployWait}, + } +} + +// deployWait closes the "did the NEW code land" gap: rollout status alone returns +// success on the OLD ReplicaSet, so we first wait for the deployment image to +// reference the expected sha, THEN block on rollout status. +func deployWait(args []string) error { + target, _ := firstPositional(args) + if target == "" || !strings.Contains(target, "/") { + return fmt.Errorf("usage: homelab deploy wait <ns>/<deploy> [--sha SHA] [--timeout 10m]") + } + parts := strings.SplitN(target, "/", 2) + ns, deploy := parts[0], parts[1] + + sha := flagValue(args, "--sha") + if sha == "" { + sha = short(currentHEAD()) + } + deadline := time.Now().Add(10 * time.Minute) + + if sha != "" { + fmt.Fprintf(os.Stderr, "homelab: waiting for %s/%s image to match %s...\n", ns, deploy, sha) + matched := false + for time.Now().Before(deadline) { + img, _ := kubectlCapture(ns, "get", "deploy", deploy, "-o", "jsonpath={.spec.template.spec.containers[*].image}") + if strings.Contains(img, sha) { + matched = true + break + } + time.Sleep(10 * time.Second) + } + if !matched { + return fmt.Errorf("timed out: %s/%s image never matched %q", ns, deploy, sha) + } + } + fmt.Fprintf(os.Stderr, "homelab: rollout status %s/%s...\n", ns, deploy) + return kubectlStream(ns, "rollout", "status", "deploy/"+deploy, "--timeout=180s") +} diff --git a/cli/cmd_ha.go b/cli/cmd_ha.go new file mode 100644 index 00000000..2309bdfc --- /dev/null +++ b/cli/cmd_ha.go @@ -0,0 +1,172 @@ +package main + +import ( + "encoding/base64" + "fmt" + "os" + "path/filepath" + "strings" +) + +// Home Assistant verbs cover the two things the `ha` MCP server can't: resolving +// the long-lived API token out of the cluster, and SSH to the HA host for +// host-level work (config files, docker, add-ons). Entity state/control stays +// with the MCP — see docs/adr/0012. +// +// The token lives in the dedicated k8s Secret openclaw/ha-tokens (one key per +// instance), split out of openclaw-secrets so non-admin operators (emo / "Home +// Server Admins") can read JUST the HA token, not the full skill_secrets blob. +// `ha token` resolves it on demand via the ambient kubeconfig, so it never +// depends on a pre-set env var (the gap that made agents re-derive the +// kubectl|base64|jq pipeline every session). + +type haInstance struct { + name string // sofia | london + sshUser string // SSH login on the HA host + sshHost string // host reachable from the devvm (Sofia LAN) + secretKey string // key inside the openclaw/ha-tokens Secret holding this token +} + +const ( + haDefaultInstance = "sofia" + haSecretNamespace = "openclaw" + haSecretName = "ha-tokens" // dedicated, least-privilege; see stacks/openclaw/ha_tokens.tf +) + +// haInstances maps instance name → connection/secret facts. sofia is the default +// because the devvm is on the Sofia LAN; london is documented but its host +// (192.168.8.x) is only reachable remotely, so `ha ssh --instance london` +// generally won't connect from here (token resolution still works). +var haInstances = map[string]haInstance{ + "sofia": {name: "sofia", sshUser: "vbarzin", sshHost: "192.168.1.8", secretKey: "sofia"}, + "london": {name: "london", sshUser: "hassio", sshHost: "192.168.8.103", secretKey: "london"}, +} + +func haCommands() []Command { + return []Command{ + {Path: []string{"ha", "token"}, Tier: TierRead, + Summary: "reveal the HA long-lived API token from the cluster: ha token [--instance sofia|london]", Run: haToken}, + {Path: []string{"ha", "ssh"}, Tier: TierWrite, + Summary: "run a command on the HA host over ssh: ha ssh [--instance sofia|london] [-i KEY] -- <cmd>", Run: haSSH}, + } +} + +// resolveHAInstance looks up an instance by name; "" yields the default (sofia). +func resolveHAInstance(name string) (haInstance, error) { + if name == "" { + name = haDefaultInstance + } + inst, ok := haInstances[name] + if !ok { + return haInstance{}, fmt.Errorf("unknown HA instance %q (want sofia or london)", name) + } + return inst, nil +} + +// decodeSecretValue base64-decodes a k8s Secret `.data.<key>` value as returned +// by kubectl jsonpath (trailing whitespace tolerated). +func decodeSecretValue(b64 string) (string, error) { + raw, err := base64.StdEncoding.DecodeString(strings.TrimSpace(b64)) + if err != nil { + return "", fmt.Errorf("base64-decode secret value: %w", err) + } + return string(raw), nil +} + +func haToken(args []string) error { + name, _ := firstPositional(args) // accept `ha token sofia` as well as `--instance sofia` + for i := 0; i < len(args); i++ { + if args[i] == "--instance" && i+1 < len(args) { + name = args[i+1] + } else if strings.HasPrefix(args[i], "--instance=") { + name = strings.TrimPrefix(args[i], "--instance=") + } + } + inst, err := resolveHAInstance(name) + if err != nil { + return err + } + b64, err := kubectlCapture(haSecretNamespace, "get", "secret", haSecretName, + "-o", "jsonpath={.data."+inst.secretKey+"}") + if err != nil { + return fmt.Errorf("read secret %s/%s (kubeconfig set? RBAC?): %w", haSecretNamespace, haSecretName, err) + } + if b64 == "" { + return fmt.Errorf("secret %s/%s has no %q key", haSecretNamespace, haSecretName, inst.secretKey) + } + tok, err := decodeSecretValue(b64) + if err != nil { + return err + } + fmt.Println(tok) + return nil +} + +// defaultHAKeyPath is the invoking user's ed25519 key, so the verb is per-user +// rather than tied to whoever first wrote the workflow. +func defaultHAKeyPath() string { + if home, err := os.UserHomeDir(); err == nil && home != "" { + return filepath.Join(home, ".ssh", "id_ed25519") + } + return filepath.Join("~", ".ssh", "id_ed25519") +} + +// parseHASSH reads `[--instance X] [-i|--key PATH] [-- ] <cmd...>`. Tokens after +// `--` are taken verbatim; bare tokens before it are also the remote command. +func parseHASSH(args []string) (inst haInstance, keyPath string, remote []string, err error) { + name := haDefaultInstance + keyPath = defaultHAKeyPath() + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--": + remote = append(remote, args[i+1:]...) + i = len(args) + case a == "--instance": + if i+1 < len(args) { + name = args[i+1] + i++ + } + case strings.HasPrefix(a, "--instance="): + name = strings.TrimPrefix(a, "--instance=") + case a == "--key" || a == "-i": + if i+1 < len(args) { + keyPath = args[i+1] + i++ + } + case strings.HasPrefix(a, "--key="): + keyPath = strings.TrimPrefix(a, "--key=") + default: + remote = append(remote, a) + } + } + inst, err = resolveHAInstance(name) + return inst, keyPath, remote, err +} + +// buildHASSHArgs assembles deterministic, non-interactive ssh args: an explicit +// key, no user ssh config, and no known_hosts prompt/record — so it runs +// unattended in an agent session without hanging on a host-key prompt. +func buildHASSHArgs(inst haInstance, keyPath string, remote []string) []string { + args := []string{ + "-F", "/dev/null", + "-o", "IdentityFile=" + keyPath, + "-o", "StrictHostKeyChecking=no", + "-o", "UserKnownHostsFile=/dev/null", + "-o", "ConnectTimeout=10", + "-o", "BatchMode=yes", + inst.sshUser + "@" + inst.sshHost, + } + return append(args, remote...) +} + +func haSSH(args []string) error { + inst, keyPath, remote, err := parseHASSH(args) + if err != nil { + return err + } + if len(remote) == 0 { + return fmt.Errorf(`usage: homelab ha ssh [--instance sofia|london] [-i KEY] -- <command>`) + } + return runStreaming("ssh", buildHASSHArgs(inst, keyPath, remote)...) +} diff --git a/cli/cmd_ha_test.go b/cli/cmd_ha_test.go new file mode 100644 index 00000000..9dc10e11 --- /dev/null +++ b/cli/cmd_ha_test.go @@ -0,0 +1,92 @@ +package main + +import ( + "encoding/base64" + "reflect" + "strings" + "testing" +) + +func TestResolveHAInstance(t *testing.T) { + // empty defaults to sofia (the devvm sits on the Sofia LAN) + if got, err := resolveHAInstance(""); err != nil || got.name != "sofia" { + t.Fatalf(`resolveHAInstance("") = %+v, %v; want sofia`, got, err) + } + if got, err := resolveHAInstance("sofia"); err != nil || got.secretKey != "sofia" { + t.Fatalf("sofia secretKey = %q, %v", got.secretKey, err) + } + if got, err := resolveHAInstance("london"); err != nil || got.secretKey != "london" || got.sshUser != "hassio" { + t.Fatalf("london = %+v, %v", got, err) + } + if _, err := resolveHAInstance("paris"); err == nil { + t.Fatalf("resolveHAInstance(paris) should error on unknown instance") + } +} + +func TestDecodeSecretValue(t *testing.T) { + // k8s stores Secret values base64-encoded; `kubectl -o jsonpath={.data.<k>}` + // returns that base64, which decodeSecretValue turns back into the raw token. + enc := base64.StdEncoding.EncodeToString([]byte("tok-sofia")) + if got, err := decodeSecretValue(enc); err != nil || got != "tok-sofia" { + t.Fatalf("decodeSecretValue = %q, %v; want tok-sofia", got, err) + } + // trailing whitespace/newline from jsonpath output must be tolerated + if got, err := decodeSecretValue(enc + "\n"); err != nil || got != "tok-sofia" { + t.Fatalf("decodeSecretValue (trailing ws) = %q, %v; want tok-sofia", got, err) + } + if _, err := decodeSecretValue("not-base64!!"); err == nil { + t.Fatalf("decodeSecretValue should error on undecodable base64") + } +} + +func TestBuildHASSHArgs(t *testing.T) { + inst, _ := resolveHAInstance("sofia") + got := buildHASSHArgs(inst, "/home/u/.ssh/id_ed25519", []string{"cat", "/config/configuration.yaml"}) + want := []string{ + "-F", "/dev/null", + "-o", "IdentityFile=/home/u/.ssh/id_ed25519", + "-o", "StrictHostKeyChecking=no", + "-o", "UserKnownHostsFile=/dev/null", + "-o", "ConnectTimeout=10", + "-o", "BatchMode=yes", + "vbarzin@192.168.1.8", + "cat", "/config/configuration.yaml", + } + if !reflect.DeepEqual(got, want) { + t.Fatalf("buildHASSHArgs =\n %v\nwant\n %v", got, want) + } +} + +func TestParseHASSH(t *testing.T) { + // instance flag + everything after `--` is the verbatim remote command + inst, key, remote, err := parseHASSH([]string{"--instance", "sofia", "--", "docker", "ps", "-a"}) + if err != nil { + t.Fatalf("parseHASSH err: %v", err) + } + if inst.name != "sofia" { + t.Errorf("instance = %q, want sofia", inst.name) + } + if !strings.HasSuffix(key, "/.ssh/id_ed25519") { + t.Errorf("default key = %q, want it to end in /.ssh/id_ed25519", key) + } + if !reflect.DeepEqual(remote, []string{"docker", "ps", "-a"}) { + t.Errorf("remote = %v, want [docker ps -a]", remote) + } + + // bare args (no `--`) are also taken as the remote command; -i overrides the key + _, key2, remote2, err := parseHASSH([]string{"-i", "/tmp/k", "uptime"}) + if err != nil { + t.Fatalf("parseHASSH err: %v", err) + } + if key2 != "/tmp/k" { + t.Errorf("key = %q, want /tmp/k", key2) + } + if !reflect.DeepEqual(remote2, []string{"uptime"}) { + t.Errorf("remote = %v, want [uptime]", remote2) + } + + // unknown instance surfaces as an error + if _, _, _, err := parseHASSH([]string{"--instance", "paris", "--", "ls"}); err == nil { + t.Errorf("parseHASSH should error on unknown instance") + } +} diff --git a/cli/cmd_k8s.go b/cli/cmd_k8s.go new file mode 100644 index 00000000..80f8f62d --- /dev/null +++ b/cli/cmd_k8s.go @@ -0,0 +1,288 @@ +package main + +import ( + "fmt" + "os" + "strings" +) + +func k8sCommands() []Command { + return []Command{ + {Path: []string{"k8s", "status"}, Tier: TierRead, + Summary: "pods (wide) + recent non-Normal events for a namespace (or -A)", Run: k8sStatus}, + {Path: []string{"k8s", "get"}, Tier: TierRead, + Summary: "kubectl get in a namespace: k8s get <ns> <resource> [args]", Run: k8sGet}, + {Path: []string{"k8s", "logs"}, Tier: TierRead, + Summary: "logs for <app> (deploy/<app>; --tail/-c/--previous/--since/-l)", Run: k8sLogs}, + {Path: []string{"k8s", "describe"}, Tier: TierRead, + Summary: "describe <app>'s deployment (or an explicit resource)", Run: k8sDescribe}, + {Path: []string{"k8s", "debug"}, Tier: TierRead, + Summary: "one-shot triage for <app>: pods+deploy+describe+logs+events", Run: k8sDebug}, + {Path: []string{"k8s", "pf"}, Tier: TierRead, + Summary: "port-forward: k8s pf <app> <local:remote> [svc/pod target]", Run: k8sPortForward}, + {Path: []string{"k8s", "db"}, Tier: TierWrite, + Summary: `query a dbaas DB: k8s db <app> [--mysql] [--db N] -- "<SQL>"`, Run: k8sDB}, + {Path: []string{"k8s", "exec"}, Tier: TierWrite, + Summary: "exec in <app>'s pod: k8s exec <app> [--tty] -- <cmd>", Run: k8sExec}, + {Path: []string{"k8s", "rm-pod"}, Tier: TierWrite, + Summary: "delete a stuck pod/job ONLY: k8s rm-pod <name> -n <ns> [--job] [--force]", Run: k8sRmPod}, + {Path: []string{"k8s", "rollout-status"}, Tier: TierRead, + Summary: "rollout status of deploy/<app>", Run: k8sRolloutStatus}, + {Path: []string{"k8s", "restart"}, Tier: TierWrite, + Summary: "rollout restart deploy/<app> then wait for status", Run: k8sRestart}, + {Path: []string{"k8s", "probe"}, Tier: TierRead, + Summary: "in-cluster reachability: ephemeral curl pod to <app>.<ns>.svc", Run: k8sProbe}, + } +} + +func k8sStatus(args []string) error { + t := parseK8sTarget(args) + ns := t.namespace() // "" when no app/ns given → cluster-wide + get := []string{"get", "pods", "-o", "wide"} + ev := []string{"get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp"} + if ns == "" { + get = append(get, "-A") + ev = append(ev, "-A") + } + if err := kubectlStream(ns, get...); err != nil { + return err + } + fmt.Fprintln(os.Stderr, "\n--- recent events (type!=Normal) ---") + _ = kubectlStream(ns, ev...) // best-effort + return nil +} + +func k8sGet(args []string) error { + t := parseK8sTarget(args) + if t.app == "" || len(t.rest) == 0 { + return fmt.Errorf("usage: homelab k8s get <ns> <resource> [args]") + } + return kubectlStream(t.app, append([]string{"get"}, t.rest...)...) +} + +func k8sLogs(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s logs <app> [--tail N] [-c ctr] [--previous] [--since 1h] [-l sel]") + } + a := []string{"logs"} + if t.selector != "" { + a = append(a, "-l", t.selector) + } else { + a = append(a, t.objectRef()) + } + if t.container != "" { + a = append(a, "-c", t.container) + } + if !containsPrefix(t.rest, "--tail") { + a = append(a, "--tail=200") + } + a = append(a, t.rest...) + return kubectlStream(t.namespace(), a...) +} + +func k8sDescribe(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s describe <app> [resource]") + } + if len(t.rest) > 0 { + return kubectlStream(t.namespace(), append([]string{"describe"}, t.rest...)...) + } + return kubectlStream(t.namespace(), "describe", t.objectRef()) +} + +func k8sDebug(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s debug <app>") + } + ns := t.namespace() + sec := func(title string) { fmt.Fprintf(os.Stderr, "\n=== %s ===\n", title) } + sec("pods") + _ = kubectlStream(ns, "get", "pods", "-o", "wide") + sec("workloads") + _ = kubectlStream(ns, "get", "deploy,sts,ds", "-o", "wide") + sec("describe "+t.objectRef()) + _ = kubectlStream(ns, "describe", t.objectRef()) + sec("recent logs (--tail=50)") + _ = kubectlStream(ns, "logs", t.objectRef(), "--tail=50") + sec("events (type!=Normal)") + _ = kubectlStream(ns, "get", "events", "--field-selector", "type!=Normal", "--sort-by=.lastTimestamp") + return nil +} + +func k8sPortForward(args []string) error { + t := parseK8sTarget(args) + if t.app == "" || len(t.rest) == 0 { + return fmt.Errorf("usage: homelab k8s pf <app> <local:remote> [svc/pod target]") + } + ports := t.rest[0] + target := "svc/" + t.app + if len(t.rest) > 1 { + target = t.rest[1] + } + return kubectlStream(t.namespace(), "port-forward", target, ports) +} + +func k8sDB(args []string) error { + var app, dbName, sql string + mysql := false + for i := 0; i < len(args); i++ { + a := args[i] + if a == "--" { + sql = strings.Join(args[i+1:], " ") + break + } + switch { + case a == "--mysql": + mysql = true + case a == "--db": + if i+1 < len(args) { + dbName = args[i+1] + i++ + } + case strings.HasPrefix(a, "--db="): + dbName = strings.TrimPrefix(a, "--db=") + case !strings.HasPrefix(a, "-") && app == "": + app = a + } + } + if app == "" { + return fmt.Errorf(`usage: homelab k8s db <app> [--mysql] [--db NAME] -- "<SQL>"`) + } + p := planDBExec(app, dbName, sql, mysql) + pod := p.pod + if pod == "" && p.selector != "" { + resolved, err := kubectlCapture(p.ns, "get", "pod", "-l", p.selector, "-o", "jsonpath={.items[0].metadata.name}") + if err != nil || resolved == "" { + return fmt.Errorf("could not resolve db pod in %s (selector %q): %v", p.ns, p.selector, err) + } + pod = resolved + } + exec := []string{"exec"} + if sql == "" { + exec = append(exec, "-it") // interactive client when no SQL given + } + exec = append(exec, pod) + if p.container != "" { + exec = append(exec, "-c", p.container) + } + exec = append(exec, "--") + exec = append(exec, p.argv...) + return kubectlStream(p.ns, exec...) +} + +func k8sExec(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s exec <app> [--pod p] [-c ctr] [--tty] -- <cmd>") + } + if len(t.rest) == 0 { + return fmt.Errorf("provide a command after --, e.g. homelab k8s exec %s -- env", t.app) + } + a := []string{"exec"} + if t.tty { + a = append(a, "-it") + } + a = append(a, t.objectRef()) + if t.container != "" { + a = append(a, "-c", t.container) + } + a = append(a, "--") + a = append(a, t.rest...) + return kubectlStream(t.namespace(), a...) +} + +func k8sRmPod(args []string) error { + var pod, ns, grace string + force, job := false, false + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "-n" || a == "--namespace": + if i+1 < len(args) { + ns = args[i+1] + i++ + } + case a == "--force": + force = true + case a == "--job": + job = true + case a == "--grace": + if i+1 < len(args) { + grace = args[i+1] + i++ + } + case !strings.HasPrefix(a, "-") && pod == "": + pod = a + } + } + if pod == "" || ns == "" { + return fmt.Errorf("usage: homelab k8s rm-pod <name> -n <ns> [--job] [--force] [--grace N] (pods/jobs only)") + } + kind := "pod" + if job { + kind = "job" + } + a := []string{"delete", kind, pod} + if grace != "" { + a = append(a, "--grace-period="+grace) + } + if force { + a = append(a, "--force") + } + return kubectlStream(ns, a...) +} + +func k8sRolloutStatus(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s rollout-status <app>") + } + return kubectlStream(t.namespace(), "rollout", "status", "deploy/"+t.app) +} + +func k8sRestart(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s restart <app>") + } + ns := t.namespace() + if err := kubectlStream(ns, "rollout", "restart", "deploy/"+t.app); err != nil { + return err + } + return kubectlStream(ns, "rollout", "status", "deploy/"+t.app) +} + +func k8sProbe(args []string) error { + t := parseK8sTarget(args) + if t.app == "" { + return fmt.Errorf("usage: homelab k8s probe <app> [path] [--port N]") + } + ns := t.namespace() + url := "http://" + t.app + "." + ns + ".svc.cluster.local" + if port := flagValue(args, "--port"); port != "" { + url += ":" + port + } + if len(t.rest) > 0 { + p := t.rest[0] + if !strings.HasPrefix(p, "/") { + p = "/" + p + } + url += p + } + return kubectlStream(ns, "run", "homelab-probe", "--rm", "-i", "--restart=Never", + "--image=curlimages/curl:latest", "--", + "curl", "-sS", "--max-time", "10", "-w", "\n[%{http_code}] %{time_total}s\n", url) +} + +// containsPrefix reports whether any arg starts with prefix. +func containsPrefix(args []string, prefix string) bool { + for _, a := range args { + if strings.HasPrefix(a, prefix) { + return true + } + } + return false +} diff --git a/cli/cmd_memory.go b/cli/cmd_memory.go new file mode 100644 index 00000000..94f3a482 --- /dev/null +++ b/cli/cmd_memory.go @@ -0,0 +1,302 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/url" + "strings" +) + +func memoryCommands() []Command { + return []Command{ + {Path: []string{"memory", "recall"}, Tier: TierRead, + Summary: `semantic search of memory: memory recall "<context>" [--query …] [--category] [--sort] [--limit]`, Run: memoryRecall}, + {Path: []string{"memory", "list"}, Tier: TierRead, + Summary: "list recent memories [--category C] [--tag T] [--limit N]", Run: memoryList}, + {Path: []string{"memory", "categories"}, Tier: TierRead, + Summary: "list memory categories", Run: memorySimpleGet("/api/categories")}, + {Path: []string{"memory", "tags"}, Tier: TierRead, + Summary: "list memory tags", Run: memorySimpleGet("/api/tags")}, + {Path: []string{"memory", "stats"}, Tier: TierRead, + Summary: "memory store stats", Run: memorySimpleGet("/api/stats")}, + {Path: []string{"memory", "secret"}, Tier: TierRead, + Summary: "reveal a sensitive memory's content: memory secret <id>", Run: memorySecret}, + {Path: []string{"memory", "store"}, Tier: TierWrite, + Summary: `store a memory: memory store "<content>" [--category --tags --keywords --importance --sensitive]`, Run: memoryStore}, + {Path: []string{"memory", "update"}, Tier: TierWrite, + Summary: "update a memory: memory update <id> [--content --tags --importance --keywords]", Run: memoryUpdate}, + {Path: []string{"memory", "delete"}, Tier: TierWrite, + Summary: "delete a memory: memory delete <id>", Run: memoryDelete}, + } +} + +// printMemories renders a {memories:[…]} response as compact lines, or raw JSON. +func printMemories(raw []byte, jsonOut bool) error { + if jsonOut { + fmt.Println(string(raw)) + return nil + } + var r struct { + Memories []struct { + ID int `json:"id"` + Content string `json:"content"` + Category string `json:"category"` + Tags string `json:"tags"` + Importance float64 `json:"importance"` + } `json:"memories"` + } + if err := json.Unmarshal(raw, &r); err != nil { + fmt.Println(string(raw)) + return nil + } + if len(r.Memories) == 0 { + fmt.Println("(no memories)") + return nil + } + for _, m := range r.Memories { + c := strings.ReplaceAll(m.Content, "\n", " ") + if len(c) > 240 { + c = c[:240] + "…" + } + fmt.Printf("#%d [%s] (%.2f) %s\n", m.ID, m.Category, m.Importance, c) + if m.Tags != "" { + fmt.Printf(" tags: %s\n", m.Tags) + } + } + return nil +} + +func memoryRecall(args []string) error { + req := memRecallReq{} + jsonOut := false + var pos []string + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--query": + if i+1 < len(args) { + req.ExpandedQuery = args[i+1] + i++ + } + case a == "--category": + if i+1 < len(args) { + req.Category = args[i+1] + i++ + } + case a == "--sort": + if i+1 < len(args) { + req.SortBy = args[i+1] + i++ + } + case a == "--limit": + if i+1 < len(args) { + fmt.Sscanf(args[i+1], "%d", &req.Limit) + i++ + } + case a == "--json": + jsonOut = true + case !strings.HasPrefix(a, "-"): + pos = append(pos, a) + } + } + req.Context = strings.Join(pos, " ") + if req.Context == "" { + return fmt.Errorf(`usage: homelab memory recall "<context>" [--query …] [--category C] [--sort importance|relevance|recency] [--limit N]`) + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("POST", "/api/memories/recall", req) + if err != nil { + return err + } + return printMemories(raw, jsonOut) +} + +func memoryList(args []string) error { + q := url.Values{} + jsonOut := false + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--category": + if i+1 < len(args) { + q.Set("category", args[i+1]) + i++ + } + case a == "--tag": + if i+1 < len(args) { + q.Set("tag", args[i+1]) + i++ + } + case a == "--limit": + if i+1 < len(args) { + q.Set("limit", args[i+1]) + i++ + } + case a == "--json": + jsonOut = true + } + } + c, err := newMemoryClient() + if err != nil { + return err + } + path := "/api/memories" + if len(q) > 0 { + path += "?" + q.Encode() + } + raw, err := c.do("GET", path, nil) + if err != nil { + return err + } + return printMemories(raw, jsonOut) +} + +func memorySimpleGet(path string) func([]string) error { + return func(args []string) error { + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("GET", path, nil) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil + } +} + +func memorySecret(args []string) error { + id, _ := firstPositional(args) + if id == "" { + return fmt.Errorf("usage: homelab memory secret <id>") + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("POST", "/api/memories/"+id+"/secret", nil) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil +} + +func memoryStore(args []string) error { + req := memStoreReq{Category: "facts", Importance: 0.5} + var pos []string + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--category": + if i+1 < len(args) { + req.Category = args[i+1] + i++ + } + case a == "--tags": + if i+1 < len(args) { + req.Tags = args[i+1] + i++ + } + case a == "--keywords": + if i+1 < len(args) { + req.ExpandedKeywords = args[i+1] + i++ + } + case a == "--importance": + if i+1 < len(args) { + fmt.Sscanf(args[i+1], "%f", &req.Importance) + i++ + } + case a == "--sensitive": + req.ForceSensitive = true + case !strings.HasPrefix(a, "-"): + pos = append(pos, a) + } + } + req.Content = strings.Join(pos, " ") + if req.Content == "" { + return fmt.Errorf(`usage: homelab memory store "<content>" [--category C] [--tags ...] [--keywords ...] [--importance 0.5] [--sensitive]`) + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("POST", "/api/memories", req) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil +} + +func memoryUpdate(args []string) error { + var id string + req := memUpdateReq{} + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--content": + if i+1 < len(args) { + v := args[i+1] + req.Content = &v + i++ + } + case a == "--tags": + if i+1 < len(args) { + v := args[i+1] + req.Tags = &v + i++ + } + case a == "--keywords": + if i+1 < len(args) { + v := args[i+1] + req.ExpandedKeywords = &v + i++ + } + case a == "--importance": + if i+1 < len(args) { + var f float64 + fmt.Sscanf(args[i+1], "%f", &f) + req.Importance = &f + i++ + } + case !strings.HasPrefix(a, "-") && id == "": + id = a + } + } + if id == "" { + return fmt.Errorf("usage: homelab memory update <id> [--content ...] [--tags ...] [--importance N] [--keywords ...]") + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("PUT", "/api/memories/"+id, req) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil +} + +func memoryDelete(args []string) error { + id, _ := firstPositional(args) + if id == "" { + return fmt.Errorf("usage: homelab memory delete <id>") + } + c, err := newMemoryClient() + if err != nil { + return err + } + raw, err := c.do("DELETE", "/api/memories/"+id, nil) + if err != nil { + return err + } + fmt.Println(string(raw)) + return nil +} diff --git a/cli/cmd_net.go b/cli/cmd_net.go new file mode 100644 index 00000000..6401755c --- /dev/null +++ b/cli/cmd_net.go @@ -0,0 +1,83 @@ +package main + +import ( + "fmt" + "strings" + "time" +) + +func netCommands() []Command { + return []Command{ + {Path: []string{"net", "check"}, Tier: TierRead, + Summary: "reachability of <host>[/path]: external (public DNS→CF) vs internal (Traefik LB)", Run: netCheck}, + {Path: []string{"dns", "lookup"}, Tier: TierRead, + Summary: "resolve <name> via Technitium (10.0.20.201) and public (1.1.1.1), diffed", Run: dnsLookup}, + } +} + +func fmtProbe(code int, d time.Duration, err error) string { + if err != nil { + return "ERR " + err.Error() + } + return fmt.Sprintf("HTTP %d %dms", code, d.Milliseconds()) +} + +func netCheck(args []string) error { + host, rest := firstPositional(args) + if host == "" { + return fmt.Errorf("usage: homelab net check <host> [path]") + } + path := "/" + if len(rest) > 0 && !strings.HasPrefix(rest[0], "-") { + path = rest[0] + if !strings.HasPrefix(path, "/") { + path = "/" + path + } + } + u := "https://" + host + path + fmt.Printf("%s\n", u) + + // external leg: resolve via public DNS, dial the public IP (tests the real CF path) + pubOut, _ := dig(hostOnly(host), "1.1.1.1", "") + if pubIP := firstLine(pubOut); pubIP != "" { + c, d, e := probeURL(clientDialingIP(pubIP, 10*time.Second), u) + fmt.Printf(" external (public %-15s) %s\n", pubIP, fmtProbe(c, d, e)) + } else { + fmt.Println(" external (public) no public A record") + } + // internal leg: dial the Traefik LB directly + c, d, e := probeURL(clientDialingIP(internalLBIP, 10*time.Second), u) + fmt.Printf(" internal (LB %-15s) %s\n", internalLBIP, fmtProbe(c, d, e)) + return nil +} + +func dnsLookup(args []string) error { + name, rest := firstPositional(args) + if name == "" { + return fmt.Errorf("usage: homelab dns lookup <name> [A|AAAA|TXT|MX|PTR]") + } + rr := "" + if len(rest) > 0 { + rr = rest[0] + } + tech, _ := dig(name, "10.0.20.201", rr) + pub, _ := dig(name, "1.1.1.1", rr) + fmt.Printf("technitium (10.0.20.201): %s\n", oneLineList(tech)) + fmt.Printf("public (1.1.1.1) : %s\n", oneLineList(pub)) + if strings.TrimSpace(tech) != strings.TrimSpace(pub) { + fmt.Println("⚠ mismatch — split-horizon (expected for internal-only apps) or a propagation gap") + } + return nil +} + +func hostOnly(h string) string { // strip any path accidentally included + return strings.SplitN(h, "/", 2)[0] +} + +func oneLineList(s string) string { + s = strings.TrimSpace(s) + if s == "" { + return "(none)" + } + return strings.ReplaceAll(s, "\n", ", ") +} diff --git a/cli/cmd_obs.go b/cli/cmd_obs.go new file mode 100644 index 00000000..33f16e6c --- /dev/null +++ b/cli/cmd_obs.go @@ -0,0 +1,197 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/url" + "sort" + "strconv" + "strings" + "time" +) + +const ( + promHost = "prometheus-query.viktorbarzin.lan" + lokiHost = "loki.viktorbarzin.lan" +) + +func obsCommands() []Command { + return []Command{ + {Path: []string{"metrics", "query"}, Tier: TierRead, + Summary: `Prometheus instant query: metrics query "<promql>" [--json]`, Run: metricsQuery}, + {Path: []string{"metrics", "alerts"}, Tier: TierRead, + Summary: "list currently firing Prometheus alerts", Run: metricsAlerts}, + {Path: []string{"logs", "query"}, Tier: TierRead, + Summary: `Loki query (last --since, default 1h): logs query "<logql>" [--since 1h] [--limit N] [--json]`, Run: logsQuery}, + } +} + +// queryArg joins non-flag args into the query (PromQL/LogQL should normally be +// passed as a single quoted argument; this also tolerates unquoted multi-token). +func queryArg(args []string, valueFlags map[string]bool) string { + var parts []string + for i := 0; i < len(args); i++ { + a := args[i] + if valueFlags[a] { + i++ + continue + } + if strings.HasPrefix(a, "-") { + continue + } + parts = append(parts, a) + } + return strings.Join(parts, " ") +} + +func labelStr(m map[string]string) string { + name := m["__name__"] + var kv []string + for k, v := range m { + if k != "__name__" { + kv = append(kv, k+"="+v) + } + } + sort.Strings(kv) + return name + "{" + strings.Join(kv, ",") + "}" +} + +func metricsQuery(args []string) error { + q := queryArg(args, nil) + if q == "" { + return fmt.Errorf(`usage: homelab metrics query "<promql>" [--json]`) + } + v := url.Values{} + v.Set("query", q) + body, err := lbGetBody(promHost, "/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + Value []interface{} `json:"value"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + if len(r.Data.Result) == 0 { + fmt.Println("(no series)") + return nil + } + for _, s := range r.Data.Result { + val := "" + if len(s.Value) == 2 { + val = fmt.Sprint(s.Value[1]) + } + fmt.Printf("%-14s %s\n", val, labelStr(s.Metric)) + } + return nil +} + +func metricsAlerts(args []string) error { + // prometheus-query is a query-only frontend (no /api/v1/alerts); the firing + // set is exposed as the synthetic ALERTS series, queryable the normal way. + v := url.Values{} + v.Set("query", `ALERTS{alertstate="firing"}`) + body, err := lbGetBody(promHost, "/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + if len(r.Data.Result) == 0 { + fmt.Println("(no firing alerts)") + return nil + } + for _, a := range r.Data.Result { + m := a.Metric + scope := "" + for _, k := range []string{"namespace", "deployment", "instance", "job", "node"} { + if v := m[k]; v != "" { + scope = k + "=" + v + break + } + } + fmt.Printf("%-9s %-34s %s\n", m["severity"], m["alertname"], scope) + } + return nil +} + +func logsQuery(args []string) error { + q := queryArg(args, map[string]bool{"--since": true, "--limit": true}) + if q == "" { + return fmt.Errorf(`usage: homelab logs query "<logql>" [--since 1h] [--limit N] [--json]`) + } + since := flagValue(args, "--since") + if since == "" { + since = "1h" + } + dur, err := time.ParseDuration(since) + if err != nil { + return fmt.Errorf("bad --since %q: %w", since, err) + } + limit := flagValue(args, "--limit") + if limit == "" { + limit = "100" + } + end := time.Now() + v := url.Values{} + v.Set("query", q) + v.Set("limit", limit) + v.Set("start", strconv.FormatInt(end.Add(-dur).UnixNano(), 10)) + v.Set("end", strconv.FormatInt(end.UnixNano(), 10)) + body, err := lbGetBody(lokiHost, "/loki/api/v1/query_range", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Values [][]string `json:"values"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + n := 0 + for _, s := range r.Data.Result { + for _, val := range s.Values { + if len(val) == 2 { + fmt.Println(val[1]) + n++ + } + } + } + if n == 0 { + fmt.Println("(no log lines)") + } + return nil +} diff --git a/cli/cmd_tf.go b/cli/cmd_tf.go new file mode 100644 index 00000000..95e0260b --- /dev/null +++ b/cli/cmd_tf.go @@ -0,0 +1,122 @@ +package main + +import ( + "fmt" + "os" + "os/signal" + "path/filepath" + "strings" + "sync" + "syscall" +) + +func tfCommands() []Command { + return []Command{ + {Path: []string{"tf", "plan"}, Tier: TierRead, + Summary: "terragrunt plan a stack (via scripts/tg)", Run: tfPassthrough("plan")}, + {Path: []string{"tf", "validate"}, Tier: TierRead, + Summary: "terragrunt validate a stack", Run: tfPassthrough("validate")}, + {Path: []string{"tf", "fmt"}, Tier: TierRead, + Summary: "terraform fmt a stack's files", Run: tfFmt}, + {Path: []string{"tf", "force-unlock"}, Tier: TierWrite, + Summary: "release a stuck terraform state lock (needs <stack> <lock-id>)", Run: tfForceUnlock}, + {Path: []string{"tf", "apply"}, Tier: TierWrite, + Summary: "terragrunt apply a stack — presence-coupled, out-of-band", Run: tfApply}, + } +} + +// firstPositional returns the first non-flag arg and the remaining args with it removed. +func firstPositional(args []string) (string, []string) { + for i, a := range args { + if !strings.HasPrefix(a, "-") { + rest := append(append([]string{}, args[:i]...), args[i+1:]...) + return a, rest + } + } + return "", args +} + +// resolveTfStack finds the infra root (from cwd) and the stack directory named +// by the first positional arg, returning the remaining args. +func resolveTfStack(args []string) (infraRoot, stackName, stackDir string, rest []string, err error) { + stackName, rest = firstPositional(args) + if stackName == "" { + err = fmt.Errorf("missing <stack> argument") + return + } + cwd, e := os.Getwd() + if e != nil { + err = e + return + } + infraRoot, err = findInfraRoot(cwd) + if err != nil { + return + } + stackDir, err = resolveStack(infraRoot, stackName) + return +} + +func tgPath(infraRoot string) string { return filepath.Join(infraRoot, "scripts", "tg") } + +// tfPassthrough runs `scripts/tg <verb> [extra]` in the stack directory. +func tfPassthrough(verb string) func([]string) error { + return func(args []string) error { + infraRoot, _, stackDir, rest, err := resolveTfStack(args) + if err != nil { + return err + } + return runStreamingIn(stackDir, tgPath(infraRoot), append([]string{verb}, rest...)...) + } +} + +func tfFmt(args []string) error { + _, _, stackDir, _, err := resolveTfStack(args) + if err != nil { + return err + } + return runStreamingIn(stackDir, "terraform", "fmt", "-recursive", ".") +} + +func tfForceUnlock(args []string) error { + infraRoot, _, stackDir, rest, err := resolveTfStack(args) + if err != nil { + return err + } + if len(rest) < 1 { + return fmt.Errorf("usage: homelab tf force-unlock <stack> <lock-id>") + } + return runStreamingIn(stackDir, tgPath(infraRoot), "force-unlock", "-force", rest[0]) +} + +// tfApply applies a stack out-of-band: claim the stack on the presence board, +// ALWAYS release on exit (normal, error, or signal — fixing the claim leak), +// and warn that CI applies canonically on push. +func tfApply(args []string) error { + infraRoot, stackName, stackDir, _, err := resolveTfStack(args) + if err != nil { + return err + } + label := "stack:" + stackName + fmt.Fprintf(os.Stderr, + "homelab: out-of-band apply of %q — CI applies canonically on push to master.\n", stackName) + + if err := presenceClaim(label, "homelab tf apply "+stackName); err != nil { + return fmt.Errorf("presence claim failed (run `vault login -method=oidc`?): %w", err) + } + // Release exactly once, whether we exit normally, on error, or on signal — + // sync.Once makes the defer and the signal goroutine safe to both call it. + var once sync.Once + release := func() { once.Do(func() { _ = presenceRelease(label) }) } + defer release() + + sig := make(chan os.Signal, 1) + signal.Notify(sig, os.Interrupt, syscall.SIGTERM) + go func() { + <-sig + release() + os.Exit(130) + }() + + return runStreamingIn(stackDir, tgPath(infraRoot), "apply", "--non-interactive") +} diff --git a/cli/cmd_tf_test.go b/cli/cmd_tf_test.go new file mode 100644 index 00000000..74f5b9bd --- /dev/null +++ b/cli/cmd_tf_test.go @@ -0,0 +1,27 @@ +package main + +import ( + "reflect" + "testing" +) + +func TestFirstPositional(t *testing.T) { + cases := []struct { + args []string + wantName string + wantRest []string + }{ + {[]string{"vault"}, "vault", []string{}}, + {[]string{"--json", "vault"}, "vault", []string{"--json"}}, + {[]string{"vault", "abc-123"}, "vault", []string{"abc-123"}}, + {[]string{"--foo", "monitoring", "extra"}, "monitoring", []string{"--foo", "extra"}}, + {[]string{"--only-flags"}, "", []string{"--only-flags"}}, + } + for _, c := range cases { + gotName, gotRest := firstPositional(c.args) + if gotName != c.wantName || !reflect.DeepEqual(gotRest, c.wantRest) { + t.Errorf("firstPositional(%v) = (%q, %v), want (%q, %v)", + c.args, gotName, gotRest, c.wantName, c.wantRest) + } + } +} diff --git a/cli/cmd_usage.go b/cli/cmd_usage.go new file mode 100644 index 00000000..e9b7fa8e --- /dev/null +++ b/cli/cmd_usage.go @@ -0,0 +1,77 @@ +package main + +import ( + "encoding/json" + "fmt" + "net/url" + "sort" + "strconv" +) + +func usageCommands() []Command { + return []Command{ + {Path: []string{"usage", "top"}, Tier: TierRead, + Summary: "rank homelab verb usage across users (from Loki): usage top [--since 30d] [--user U] [--json]", Run: usageTop}, + } +} + +// usageQuery builds the LogQL metric query that counts invocations per verb. +func usageQuery(since, user string) string { + sel := `job="` + usageJob + `"` + if user != "" { + sel += `, user="` + user + `"` + } + return fmt.Sprintf(`sum by (verb) (count_over_time({%s}[%s]))`, sel, since) +} + +func usageTop(args []string) error { + since := flagValue(args, "--since") + if since == "" { + since = "30d" + } + v := url.Values{} + v.Set("query", usageQuery(since, flagValue(args, "--user"))) + body, err := lbGetBody(lokiHost, "/loki/api/v1/query", v) + if err != nil { + return err + } + if containsArg(args, "--json") { + fmt.Println(string(body)) + return nil + } + var r struct { + Data struct { + Result []struct { + Metric map[string]string `json:"metric"` + Value []interface{} `json:"value"` + } `json:"result"` + } `json:"data"` + } + if err := json.Unmarshal(body, &r); err != nil { + fmt.Println(string(body)) + return nil + } + type row struct { + verb string + n int + } + var rows []row + for _, s := range r.Data.Result { + n := 0 + if len(s.Value) == 2 { + if f, e := strconv.ParseFloat(fmt.Sprint(s.Value[1]), 64); e == nil { + n = int(f) + } + } + rows = append(rows, row{s.Metric["verb"], n}) + } + if len(rows) == 0 { + fmt.Println("(no usage recorded yet)") + return nil + } + sort.Slice(rows, func(i, j int) bool { return rows[i].n > rows[j].n }) + for _, r := range rows { + fmt.Printf("%6d %s\n", r.n, r.verb) + } + return nil +} diff --git a/cli/cmd_vault.go b/cli/cmd_vault.go new file mode 100644 index 00000000..bf270886 --- /dev/null +++ b/cli/cmd_vault.go @@ -0,0 +1,663 @@ +package main + +import ( + "bufio" + "encoding/base64" + "encoding/json" + "fmt" + "os" + "os/exec" + "strings" + "syscall" +) + +// vault verbs give each unix user no-HITL access to THEIR OWN Vaultwarden vault. +// Identity is the kernel UID; per-user creds live in that user's isolated Vault +// path (secret/workstation/claude-users/<user>) read via their scoped token, and +// decryption is done by the official `bw` CLI. See +// docs/superpowers/specs/2026-06-24-homelab-vault-design.md. +func vaultCommands() []Command { + return []Command{ + {Path: []string{"vault", "setup"}, Tier: TierWrite, + Summary: "one-time: store your Vaultwarden master password + API key in your Vault path", Run: vaultSetup}, + {Path: []string{"vault", "status"}, Tier: TierRead, + Summary: "show whether your vault is configured/reachable (no secrets)", Run: vaultStatus}, + {Path: []string{"vault", "list"}, Tier: TierRead, + Summary: "list your item names: vault list [--search Q]", Run: vaultList}, + {Path: []string{"vault", "get"}, Tier: TierRead, + Summary: "fetch one item: vault get <name> [--field password|username|uri|notes|totp] [--json]", Run: vaultGet}, + {Path: []string{"vault", "search"}, Tier: TierRead, + Summary: "search your item names: vault search <query>", Run: vaultSearch}, + {Path: []string{"vault", "code"}, Tier: TierRead, + Summary: "current TOTP code for an item: vault code <name>", Run: vaultCode}, + {Path: []string{"vault", "lock"}, Tier: TierWrite, + Summary: "lock/log out the local bw session", Run: vaultLock}, + {Path: []string{"vault"}, Tier: TierRead, + Summary: "Vaultwarden access for your own vault (run `homelab vault` for help)", + Run: func([]string) error { fmt.Print(vaultHelp()); return nil }}, + } +} + +// vaultHelp is shown for bare `homelab vault`. +func vaultHelp() string { + return `homelab vault — read YOUR OWN Vaultwarden logins (no-HITL after one-time setup) + + homelab vault setup one-time: store your master password + API key in your Vault path + homelab vault status configured / unlocked / reachable (no secrets) + homelab vault list [--search Q] list your item names (no secrets) + homelab vault get <name> [--field password|username|uri|notes|totp] [--json] + TTY → clipboard (auto-clears); piped → stdout + homelab vault code <name> current TOTP code + homelab vault lock lock / log out the local bw session + +Creds live only in your own Vault path; the admin never sees them. Identity is +your unix UID. Security model: docs/superpowers/specs/2026-06-24-homelab-vault-design.md +(note: anything running as your user can decrypt your vault — the accepted no-HITL trade). +` +} + +const vwUserPathPrefix = "secret/workstation/claude-users/" + +// vwCreds is one user's Vaultwarden auth material, read from their Vault path. +type vwCreds struct { + Email string + MasterPassword string + ClientID string + ClientSecret string +} + +// cmdRunner shells out to an external command with an explicit environment and +// returns trimmed stdout. Secrets are passed via envv, NEVER argv. Tests inject +// a fake; realRunner is the production implementation. +type cmdRunner func(name string, argv, envv []string) (string, error) + +func realRunner(name string, argv, envv []string) (string, error) { + cmd := exec.Command(name, argv...) + if envv != nil { + cmd.Env = envv + } + out, err := cmd.Output() + // Trim only the trailing newline the tool appends — NOT all whitespace, so a + // fetched secret with significant leading/trailing spaces is preserved. + return strings.TrimRight(string(out), "\r\n"), err +} + +// realRunnerStdin runs a command feeding `stdin` to it, for secret values that +// must NOT appear in argv (visible via ps / /proc/<pid>/cmdline to same-UID +// processes). Used by setup to write the master password / client_secret. +func realRunnerStdin(name string, argv, envv []string, stdin string) (string, error) { + cmd := exec.Command(name, argv...) + if envv != nil { + cmd.Env = envv + } + cmd.Stdin = strings.NewReader(stdin) + out, err := cmd.Output() + return strings.TrimRight(string(out), "\r\n"), err +} + +func vwCredsPath(user string) string { return vwUserPathPrefix + user } + +func bwAppDataDir(uid string) string { return "/run/user/" + uid + "/homelab-bw" } + +// readVaultField returns one field from a KV-v2 path, "" if absent/error. +func readVaultField(run cmdRunner, field, path string) string { + out, err := run("vault", []string{"kv", "get", "-field=" + field, path}, nil) + if err != nil { + return "" + } + return out +} + +// loadCreds reads the four vaultwarden_* keys from the user's isolated path. +// A missing master password means the user hasn't onboarded. +func loadCreds(run cmdRunner, user string) (vwCreds, error) { + p := vwCredsPath(user) + c := vwCreds{ + Email: readVaultField(run, "vaultwarden_email", p), + MasterPassword: readVaultField(run, "vaultwarden_master_password", p), + ClientID: readVaultField(run, "vaultwarden_client_id", p), + ClientSecret: readVaultField(run, "vaultwarden_client_secret", p), + } + if c.MasterPassword == "" { + return vwCreds{}, fmt.Errorf("vault not configured for this user — run `homelab vault setup`") + } + return c, nil +} + +// vaultCurrentUser/vaultCurrentUID are seams for tests (avoid conflict with repo.go's currentUser func). +var vaultCurrentUser = func() string { return os.Getenv("USER") } +var vaultCurrentUID = func() string { return fmt.Sprintf("%d", os.Getuid()) } + +// bwBaseEnv is the minimal non-secret environment bw/node need. We deliberately +// do NOT inherit the full parent env (keeps stray secrets out of the child). +func bwBaseEnv(appdata string) []string { + path := os.Getenv("PATH") + if path == "" { + path = "/usr/local/bin:/usr/bin:/bin" + } + return []string{ + "PATH=" + path, + "HOME=" + os.Getenv("HOME"), + "BITWARDENCLI_APPDATA_DIR=" + appdata, + "BW_NOINTERACTION=true", + } +} + +// bwSecretEnv adds the secret-bearing vars. session may be "" (pre-unlock). +func bwSecretEnv(appdata string, c vwCreds, session string) []string { + env := bwBaseEnv(appdata) + env = append(env, + "BW_CLIENTID="+c.ClientID, + "BW_CLIENTSECRET="+c.ClientSecret, + "BW_PASSWORD="+c.MasterPassword, + ) + if session != "" { + env = append(env, "BW_SESSION="+session) + } + return env +} + +func bwLoginArgs() []string { return []string{"login", "--apikey"} } +func bwUnlockArgs() []string { return []string{"unlock", "--passwordenv", "BW_PASSWORD", "--raw"} } +func bwGetArgs(field, name string) []string { return []string{"get", field, name} } +func bwStatusArgs() []string { return []string{"status"} } + +// bwNeedsLogin parses `bw status` JSON and reports whether a `bw login` is +// required. Unparseable/empty output → true (safer to attempt login). +func bwNeedsLogin(statusJSON string) bool { + var s struct { + Status string `json:"status"` + } + if err := json.Unmarshal([]byte(statusJSON), &s); err != nil { + return true + } + return s.Status == "unauthenticated" || s.Status == "" +} + +func bwListArgs(search string) []string { + a := []string{"list", "items"} + if search != "" { + a = append(a, "--search", search) + } + return a +} + +// bwUnlock runs `bw unlock` and returns the raw session key. +func bwUnlock(run cmdRunner, env []string) (string, error) { + out, err := run("bw", bwUnlockArgs(), env) + if err != nil { + return "", fmt.Errorf("bw unlock failed (wrong master password? run `homelab vault setup`): %w", err) + } + return out, nil +} + +// bwGet fetches one field of one item; session must be present in env. +func bwGet(run cmdRunner, env []string, field, name string) (string, error) { + return run("bw", bwGetArgs(field, name), env) +} + +func returnMode(isTTY bool) string { + if isTTY { + return "clipboard" + } + return "stdout" +} + +// stdoutIsTTY reports whether stdout is a character device (a terminal). +func stdoutIsTTY() bool { + fi, err := os.Stdout.Stat() + if err != nil { + return false + } + return fi.Mode()&os.ModeCharDevice != 0 +} + +// stderrIsTTY reports whether stderr is a terminal (the OSC52 escape is written +// to stderr, so the clipboard path is only viable when stderr is a terminal). +func stderrIsTTY() bool { + fi, err := os.Stderr.Stat() + if err != nil { + return false + } + return fi.Mode()&os.ModeCharDevice != 0 +} + +// osc52 returns the OSC 52 escape that makes the local terminal copy payload to +// the system clipboard (works over SSH; no X11). osc52clear copies empty. +func osc52(payload string) string { + return "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte(payload)) + "\a" +} +func osc52clear() string { return "\x1b]52;c;\a" } + +// terminalAllowed gates OSC 52: only terminals known to honor clipboard writes, +// else we'd dump the secret's base64 into scrollback on unsupported terminals. +func terminalAllowed(term, termProgram string) bool { + t := strings.ToLower(term) + p := strings.ToLower(termProgram) + for _, ok := range []string{"kitty", "alacritty", "foot", "wezterm", "ghostty", "tmux", "screen"} { + if strings.Contains(t, ok) || strings.Contains(p, ok) { + return true + } + } + // xterm proper supports it only when the program is a known-good emulator. + return false +} + +// opRecord is one CLI operation. ItemName is accepted for the caller's +// convenience but is INTENTIONALLY never rendered into the log line — auditing +// which of your own logins you opened is itself sensitive, and per-item reads +// are invisible server-side anyway (spec §9a). +type opRecord struct { + User string + Verb string + PID int + PPID int + ParentComm string + ItemName string // never logged +} + +func opLogLine(r opRecord) string { + return fmt.Sprintf("user=%s verb=%s pid=%d ppid=%d parent=%s", + r.User, r.Verb, r.PID, r.PPID, r.ParentComm) +} + +// parentComm reads /proc/<ppid>/comm (best-effort; "" on failure). +func parentComm(ppid int) string { + b, err := os.ReadFile(fmt.Sprintf("/proc/%d/comm", ppid)) + if err != nil { + return "" + } + return strings.TrimSpace(string(b)) +} + +// writeOpLog appends one privacy-aware line to the user's op-log (best-effort; +// never blocks or fails the command). Goes to syslog so it ships to Loki. +func writeOpLog(r opRecord) { + exec.Command("logger", "-t", "homelab-vault", opLogLine(r)).Run() // best-effort +} + +func vaultLockPath(uid string) string { return "/run/user/" + uid + "/homelab-vault.lock" } + +// hardenProcess disables core dumps so a bw/homelab crash can't spill the master +// password to a core file. Best-effort. +func hardenProcess() { + _ = syscall.Setrlimit(syscall.RLIMIT_CORE, &syscall.Rlimit{Cur: 0, Max: 0}) +} + +// withUserLock serializes bw mutations for this user (concurrent Claude sessions +// as the same user otherwise race bw's appdata). Returns an unlock func. +func withUserLock(uid string) (func(), error) { + f, err := os.OpenFile(vaultLockPath(uid), os.O_CREATE|os.O_RDWR, 0600) + if err != nil { + return nil, err + } + if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX); err != nil { + f.Close() + return nil, err + } + return func() { syscall.Flock(int(f.Fd()), syscall.LOCK_UN); f.Close() }, nil +} + +// session is one usable bw context: the env (with BW_SESSION) ready for `bw get`. +type session struct { + env []string +} + +// openSession resolves creds, ensures login, unlocks, and returns a ready env. +// Caller must hold the user lock. appdata is created on tmpfs (0700). +func openSession(run cmdRunner, user, uid string) (session, error) { + creds, err := loadCreds(run, user) + if err != nil { + return session{}, err + } + appdata := bwAppDataDir(uid) + if err := os.MkdirAll(appdata, 0700); err != nil { + return session{}, fmt.Errorf("create bw appdata %s: %w", appdata, err) + } + loginEnv := bwSecretEnv(appdata, creds, "") + // Ensure server is set and we're logged in (idempotent; ignore "already"). + _, _ = run("bw", []string{"config", "server", "https://vaultwarden.viktorbarzin.me"}, loginEnv) + st, _ := run("bw", bwStatusArgs(), loginEnv) + if bwNeedsLogin(st) { + if _, err := run("bw", bwLoginArgs(), loginEnv); err != nil { + return session{}, fmt.Errorf("bw login --apikey failed (API key valid? run `homelab vault setup`): %w", err) + } + } + sess, err := bwUnlock(run, loginEnv) + if err != nil { + return session{}, err + } + return session{env: bwSecretEnv(appdata, creds, sess)}, nil +} + +type getOpts struct { + name string + field string + json bool +} + +var validGetFields = map[string]bool{"password": true, "username": true, "uri": true, "notes": true, "totp": true} + +func parseGetArgs(args []string) (getOpts, error) { + o := getOpts{field: "password"} + for i := 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--json": + o.json = true + case a == "--field" && i+1 < len(args): + o.field = args[i+1] + i++ + case strings.HasPrefix(a, "--field="): + o.field = strings.TrimPrefix(a, "--field=") + case !strings.HasPrefix(a, "-") && o.name == "": + o.name = a + } + } + if o.name == "" { + return o, fmt.Errorf("usage: homelab vault get <name> [--field password|username|uri|notes|totp] [--json]") + } + if !validGetFields[o.field] { + return o, fmt.Errorf("invalid --field %q (want password|username|uri|notes|totp)", o.field) + } + return o, nil +} + +// getValue opens a session and fetches one field. Pure of I/O side effects +// besides the runner, so it is unit-tested with a fake runner. +func getValue(run cmdRunner, user, uid string, o getOpts) (string, error) { + s, err := openSession(run, user, uid) + if err != nil { + return "", err + } + return bwGet(run, s.env, o.field, o.name) +} + +// clipboardDecision picks how to return a secret value. "stdout" prints it (a +// pipe/agent — the intended machine path); "clipboard" copies via OSC52; +// "refuse" emits nothing sensitive (would otherwise risk dumping the secret's +// base64 into scrollback, or silently fail because the OSC52 escape goes to a +// non-terminal stderr). +func clipboardDecision(stdoutTTY, stderrTTY bool, term, termProgram string) string { + if !stdoutTTY { + return "stdout" + } + if terminalAllowed(term, termProgram) && stderrTTY { + return "clipboard" + } + return "refuse" +} + +// jsonToStdoutOK reports whether `--json` may print the secret to stdout — only +// when stdout is NOT a terminal (i.e. piped to a machine consumer). +func jsonToStdoutOK(stdoutTTY bool) bool { return !stdoutTTY } + +// emitSecret returns a value TTY-aware (see clipboardDecision). Never prints the +// secret to a terminal's stdout/scrollback. +func emitSecret(value string) { + switch clipboardDecision(stdoutIsTTY(), stderrIsTTY(), os.Getenv("TERM"), os.Getenv("TERM_PROGRAM")) { + case "stdout": + fmt.Println(value) + case "clipboard": + fmt.Fprint(os.Stderr, osc52(value)) + fmt.Fprintln(os.Stderr, "copied to clipboard; clearing in 30s") + clearClipboardAfter(30) + default: // refuse + fmt.Fprintln(os.Stderr, "refusing to print secret: this terminal can't do OSC52 clipboard safely; pipe the command (e.g. | cat) or use a supported terminal") + } +} + +// clearClipboardAfter spawns a detached background clear so the secret doesn't +// linger in the clipboard. Best-effort. +func clearClipboardAfter(seconds int) { + exec.Command("sh", "-c", fmt.Sprintf("sleep %d; printf '%s'", seconds, osc52clear())).Start() +} + +// listNames extracts "name (id)" from `bw list items` JSON; never values. +func listNames(jsonOut string) []string { + var items []struct { + ID string `json:"id"` + Name string `json:"name"` + } + if err := json.Unmarshal([]byte(jsonOut), &items); err != nil { + return nil + } + out := make([]string, 0, len(items)) + for _, it := range items { + out = append(out, fmt.Sprintf("%s (%s)", it.Name, it.ID)) + } + return out +} + +func runList(run cmdRunner, user, uid, search string) ([]string, error) { + s, err := openSession(run, user, uid) + if err != nil { + return nil, err + } + out, err := run("bw", bwListArgs(search), s.env) + if err != nil { + return nil, err + } + return listNames(out), nil +} + +func vaultList(args []string) error { + hardenProcess() + search := "" + for i := 0; i < len(args); i++ { + if args[i] == "--search" && i+1 < len(args) { + search = args[i+1] + i++ + } else if strings.HasPrefix(args[i], "--search=") { + search = strings.TrimPrefix(args[i], "--search=") + } + } + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + names, err := runList(realRunner, vaultCurrentUser(), uid, search) + if err != nil { + return err + } + for _, n := range names { + fmt.Println(n) + } + return nil +} + +func vaultSearch(args []string) error { + if len(args) == 0 { + return fmt.Errorf("usage: homelab vault search <query>") + } + return vaultList([]string{"--search", strings.Join(args, " ")}) +} + +func vaultCode(args []string) error { + hardenProcess() + if len(args) == 0 { + return fmt.Errorf("usage: homelab vault code <name>") + } + name := args[0] + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + user := vaultCurrentUser() + val, err := getValue(realRunner, user, uid, getOpts{name: name, field: "totp"}) + if err != nil { + return err + } + // TOTP is the most sensitive op: log AND emit an ntfy-bound marker (spec §9a-d). + writeOpLog(opRecord{User: user, Verb: "code", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: name}) + exec.Command("logger", "-t", "homelab-vault-totp", "user="+user+" totp-fetch parent="+parentComm(os.Getppid())).Run() + emitSecret(val) + return nil +} + +// statusSummary reports config/reachability without revealing secrets. +func statusSummary(run cmdRunner, user, uid string) string { + if _, err := loadCreds(run, user); err != nil { + return "vault: not configured — run `homelab vault setup`" + } + s, err := openSession(run, user, uid) + if err != nil { + return "vault: configured, but unlock/login FAILED (creds stale? run `homelab vault setup`): " + err.Error() + } + if _, err := run("bw", []string{"sync"}, s.env); err != nil { + return "vault: configured + unlocked, but sync/reachability failed: " + err.Error() + } + return "vault: configured, unlocked, reachable ✓" +} + +func vaultStatus(args []string) error { + hardenProcess() + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + fmt.Println(statusSummary(realRunner, vaultCurrentUser(), uid)) + return nil +} + +func vaultLock(args []string) error { + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) // logout mutates bw state — serialize with get/list + if err != nil { + return err + } + defer unlock() + appdata := bwAppDataDir(uid) + _, _ = realRunner("bw", []string{"lock"}, bwBaseEnv(appdata)) + _, logoutErr := realRunner("bw", []string{"logout"}, bwBaseEnv(appdata)) + if logoutErr == nil { + fmt.Println("locked") + } + return nil // lock/logout best-effort; never error the caller +} + +// vaultPatchPublicArgs writes the non-secret identifiers via argv. Neither the +// email nor the API client_id is a usable credential on its own. +func vaultPatchPublicArgs(user, email, clientID string) []string { + return []string{"kv", "patch", vwCredsPath(user), + "vaultwarden_email=" + email, + "vaultwarden_client_id=" + clientID, + } +} + +// vaultPatchSecretArgs writes ONE secret value via the `key=-` stdin form, so +// the value never appears in argv (ps / /proc/<pid>/cmdline). The value is fed +// on stdin by realRunnerStdin. +func vaultPatchSecretArgs(user, key string) []string { + return []string{"kv", "patch", vwCredsPath(user), key + "=-"} +} + +// writeCreds stores all four fields in the user's Vault path. The two real +// secrets (master password, API client_secret) go via stdin — never argv. +func writeCreds(user string, c vwCreds) error { + if _, err := realRunner("vault", vaultPatchPublicArgs(user, c.Email, c.ClientID), nil); err != nil { + return err + } + if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_master_password"), nil, c.MasterPassword); err != nil { + return err + } + if _, err := realRunnerStdin("vault", vaultPatchSecretArgs(user, "vaultwarden_client_secret"), nil, c.ClientSecret); err != nil { + return err + } + return nil +} + +// promptNoEcho reads one line without terminal echo (for the master password). +func promptNoEcho(prompt string) (string, error) { + fmt.Fprint(os.Stderr, prompt) + exec.Command("stty", "-echo").Run() + defer func() { exec.Command("stty", "echo").Run(); fmt.Fprintln(os.Stderr) }() + r := bufio.NewReader(os.Stdin) + line, err := r.ReadString('\n') + // Trim only the line terminator — a master password / API secret may + // legitimately contain leading/trailing spaces. + return strings.TrimRight(line, "\r\n"), err +} + +func promptLine(prompt string) (string, error) { + fmt.Fprint(os.Stderr, prompt) + line, err := bufio.NewReader(os.Stdin).ReadString('\n') + return strings.TrimSpace(line), err +} + +func vaultSetup(args []string) error { + hardenProcess() + fmt.Fprintln(os.Stderr, "One-time setup. Stored ONLY in your own Vault path; the admin never sees it.") + fmt.Fprintln(os.Stderr, "Get your API key at https://vaultwarden.viktorbarzin.me → Settings → Security → Keys → View API key.") + email, err := promptLine("Vaultwarden email: ") + if err != nil { + return err + } + clientID, err := promptLine("API key client_id (user.xxxx): ") + if err != nil { + return err + } + clientSecret, err := promptNoEcho("API key client_secret: ") + if err != nil { + return err + } + master, err := promptNoEcho("Master password: ") + if err != nil { + return err + } + if master == "" || clientID == "" || clientSecret == "" { + return fmt.Errorf("all fields are required") + } + c := vwCreds{Email: email, MasterPassword: master, ClientID: clientID, ClientSecret: clientSecret} + if err := writeCreds(vaultCurrentUser(), c); err != nil { + return fmt.Errorf("writing creds to your Vault path failed (scoped token present?): %w", err) + } + fmt.Fprintln(os.Stderr, "Stored. Verifying unlock…") + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + if _, err := openSession(realRunner, vaultCurrentUser(), uid); err != nil { + return fmt.Errorf("stored, but verification failed — double-check master password / API key: %w", err) + } + fmt.Fprintln(os.Stderr, "✓ Verified. Fetches are now AFK.") + return nil +} + +func vaultGet(args []string) error { + hardenProcess() + o, err := parseGetArgs(args) + if err != nil { + return err + } + uid := vaultCurrentUID() + unlock, err := withUserLock(uid) + if err != nil { + return err + } + defer unlock() + user := vaultCurrentUser() + val, err := getValue(realRunner, user, uid, o) + if err != nil { + return err + } + writeOpLog(opRecord{User: user, Verb: "get", PID: os.Getpid(), PPID: os.Getppid(), ParentComm: parentComm(os.Getppid()), ItemName: o.name}) + if o.json { + if !jsonToStdoutOK(stdoutIsTTY()) { + return fmt.Errorf("refusing to print a secret as JSON to a terminal; pipe it (e.g. | cat) or drop --json") + } + fmt.Printf("{%q:%q}\n", o.field, val) + return nil + } + emitSecret(val) + return nil +} + diff --git a/cli/cmd_vault_test.go b/cli/cmd_vault_test.go new file mode 100644 index 00000000..36aab1f4 --- /dev/null +++ b/cli/cmd_vault_test.go @@ -0,0 +1,368 @@ +package main + +import ( + "encoding/base64" + "fmt" + "os" + "reflect" + "strings" + "testing" +) + +func TestVaultCommandsRegistered(t *testing.T) { + want := map[string]Tier{ + "vault setup": TierWrite, + "vault status": TierRead, + "vault list": TierRead, + "vault get": TierRead, + "vault search": TierRead, + "vault code": TierRead, + "vault lock": TierWrite, + } + got := map[string]Tier{} + for _, c := range vaultCommands() { + got[c.name()] = c.Tier + } + for name, tier := range want { + if got[name] != tier { + t.Errorf("command %q: tier=%q, want %q (registered=%v)", name, got[name], tier, got[name] != "") + } + } +} + +func TestVaultGroupInRegistry(t *testing.T) { + if !isCommandGroup(buildRegistry(), "vault") { + t.Fatal("`vault` group not wired into buildRegistry()") + } +} + +func TestVaultCredsPath(t *testing.T) { + if got := vwCredsPath("emo"); got != "secret/workstation/claude-users/emo" { + t.Fatalf("vwCredsPath = %q", got) + } +} + +func TestBwAppDataDir(t *testing.T) { + if got := bwAppDataDir("1001"); got != "/run/user/1001/homelab-bw" { + t.Fatalf("bwAppDataDir = %q", got) + } +} + +// fakeRunner records calls and returns canned stdout/err keyed by argv[0]+first arg. +type fakeRunner struct { + calls [][]string + out map[string]string // key: name+" "+strings.Join(argv," ") prefix-matched + err map[string]error + lastEnv []string +} + +func (f *fakeRunner) run(name string, argv, envv []string) (string, error) { + f.calls = append(f.calls, append([]string{name}, argv...)) + f.lastEnv = envv + key := name + " " + strings.Join(argv, " ") + for k, v := range f.out { + if strings.HasPrefix(key, k) { + return v, f.err[k] + } + } + return "", f.err[key] +} + +func TestLoadCredsReadsFourFields(t *testing.T) { + f := &fakeRunner{out: map[string]string{ + "vault kv get -field=vaultwarden_email secret/workstation/claude-users/emo": "emo@x.me", + "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "hunter2", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.abc", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "sek", + }} + c, err := loadCreds(f.run, "emo") + if err != nil { + t.Fatalf("loadCreds: %v", err) + } + want := vwCreds{Email: "emo@x.me", MasterPassword: "hunter2", ClientID: "user.abc", ClientSecret: "sek"} + if !reflect.DeepEqual(c, want) { + t.Fatalf("loadCreds = %+v want %+v", c, want) + } +} + +func TestLoadCredsUnconfigured(t *testing.T) { + f := &fakeRunner{out: map[string]string{}} // every field empty + if _, err := loadCreds(f.run, "emo"); err == nil || !strings.Contains(err.Error(), "not configured") { + t.Fatalf("want 'not configured' error, got %v", err) + } +} + +func TestBwEnvCarriesSecretsNotArgv(t *testing.T) { + c := vwCreds{ClientID: "user.abc", ClientSecret: "sek", MasterPassword: "hunter2"} + env := bwSecretEnv("/run/user/1001/homelab-bw", c, "SESSIONKEY") + joined := strings.Join(env, "\n") + for _, want := range []string{ + "BW_CLIENTID=user.abc", "BW_CLIENTSECRET=sek", "BW_PASSWORD=hunter2", + "BW_SESSION=SESSIONKEY", "BITWARDENCLI_APPDATA_DIR=/run/user/1001/homelab-bw", + } { + if !strings.Contains(joined, want) { + t.Errorf("bwSecretEnv missing %q", want) + } + } + if strings.Contains(joined, "PATH=") == false { + t.Error("bwSecretEnv must keep a PATH so node/bw resolve") + } +} + +func TestBwGetArgsHasNoSessionInArgv(t *testing.T) { + argv := bwGetArgs("password", "github") + for _, a := range argv { + if strings.Contains(a, "SESSION") || a == "--session" { + t.Fatalf("session must travel via env, not argv: %v", argv) + } + } + if !reflect.DeepEqual(argv, []string{"get", "password", "github"}) { + t.Fatalf("bwGetArgs = %v", argv) + } +} + +func TestBwListArgs(t *testing.T) { + if got := bwListArgs(""); !reflect.DeepEqual(got, []string{"list", "items"}) { + t.Fatalf("bwListArgs('') = %v", got) + } + if got := bwListArgs("git"); !reflect.DeepEqual(got, []string{"list", "items", "--search", "git"}) { + t.Fatalf("bwListArgs('git') = %v", got) + } +} + +func TestBwUnlockReturnsSession(t *testing.T) { + f := &fakeRunner{out: map[string]string{"bw unlock": "THE-SESSION-KEY"}} + env := bwSecretEnv("/run/user/1001/homelab-bw", vwCreds{MasterPassword: "pw"}, "") + sess, err := bwUnlock(f.run, env) + if err != nil || sess != "THE-SESSION-KEY" { + t.Fatalf("bwUnlock = %q, %v", sess, err) + } + // argv must use --passwordenv + --raw, never the password literal + last := f.calls[len(f.calls)-1] + if strings.Join(last, " ") != "bw unlock --passwordenv BW_PASSWORD --raw" { + t.Fatalf("unlock argv = %v", last) + } +} + +func TestReturnMode(t *testing.T) { + if returnMode(true) != "clipboard" || returnMode(false) != "stdout" { + t.Fatal("returnMode wrong") + } +} + +func TestOSC52Encode(t *testing.T) { + got := osc52("secret") + want := "\x1b]52;c;" + base64.StdEncoding.EncodeToString([]byte("secret")) + "\a" + if got != want { + t.Fatalf("osc52 = %q want %q", got, want) + } + if osc52clear() != "\x1b]52;c;\a" { + t.Fatalf("osc52clear wrong: %q", osc52clear()) + } +} + +func TestTerminalAllowed(t *testing.T) { + allow := []struct{ term, prog string }{ + {"xterm-kitty", ""}, {"alacritty", ""}, {"foot", ""}, {"tmux-256color", ""}, + {"screen-256color", ""}, {"xterm-256color", "WezTerm"}, {"xterm-256color", "ghostty"}, + } + for _, c := range allow { + if !terminalAllowed(c.term, c.prog) { + t.Errorf("terminalAllowed(%q,%q) = false, want true", c.term, c.prog) + } + } + deny := []struct{ term, prog string }{{"dumb", ""}, {"", ""}, {"vt100", ""}} + for _, c := range deny { + if terminalAllowed(c.term, c.prog) { + t.Errorf("terminalAllowed(%q,%q) = true, want false", c.term, c.prog) + } + } +} + +func TestOpLogLineHasNoSecretOrItem(t *testing.T) { + line := opLogLine(opRecord{User: "emo", Verb: "get", PID: 10, PPID: 9, ParentComm: "claude", ItemName: "Chase Bank"}) + for _, must := range []string{"user=emo", "verb=get", "ppid=9", "parent=claude"} { + if !strings.Contains(line, must) { + t.Errorf("op-log missing %q: %s", must, line) + } + } + for _, mustNot := range []string{"Chase", "password", "secret"} { + if strings.Contains(line, mustNot) { + t.Fatalf("op-log LEAKS %q (privacy violation): %s", mustNot, line) + } + } +} + +func TestLockPath(t *testing.T) { + if got := vaultLockPath("1001"); got != "/run/user/1001/homelab-vault.lock" { + t.Fatalf("vaultLockPath = %q", got) + } +} + +func TestParseGetArgs(t *testing.T) { + o, err := parseGetArgs([]string{"github", "--field", "username", "--json"}) + if err != nil || o.name != "github" || o.field != "username" || !o.json { + t.Fatalf("parseGetArgs = %+v err=%v", o, err) + } + d, _ := parseGetArgs([]string{"github"}) + if d.field != "password" || d.json { + t.Fatalf("defaults wrong: %+v", d) + } + if _, err := parseGetArgs([]string{}); err == nil { + t.Fatal("get with no name must error") + } + if _, err := parseGetArgs([]string{"x", "--field", "evil"}); err == nil { + t.Fatal("invalid --field must error") + } +} + +func TestListNamesParsing(t *testing.T) { + // bw list items returns JSON; listNames extracts name + id only. + js := `[{"id":"1","name":"GitHub","login":{"username":"u"}},{"id":"2","name":"AWS"}]` + names := listNames(js) + if len(names) != 2 || names[0] != "GitHub (1)" || names[1] != "AWS (2)" { + t.Fatalf("listNames = %v", names) + } +} + +func TestStatusSummaryUnconfigured(t *testing.T) { + f := &fakeRunner{out: map[string]string{}} // no creds + s := statusSummary(f.run, "emo", "1001") + if !strings.Contains(s, "not configured") { + t.Fatalf("status = %q", s) + } +} + +func TestVaultPatchPublicArgs(t *testing.T) { + got := vaultPatchPublicArgs("emo", "e@x.me", "user.ci") + want := []string{"kv", "patch", "secret/workstation/claude-users/emo", + "vaultwarden_email=e@x.me", "vaultwarden_client_id=user.ci"} + if !reflect.DeepEqual(got, want) { + t.Fatalf("vaultPatchPublicArgs = %v", got) + } + for _, a := range got { + if strings.Contains(a, "master_password") || strings.Contains(a, "client_secret") { + t.Fatalf("secret key leaked into public argv: %v", got) + } + } +} + +func TestVaultPatchSecretArgsNoValueInArgv(t *testing.T) { + for _, key := range []string{"vaultwarden_master_password", "vaultwarden_client_secret"} { + got := vaultPatchSecretArgs("emo", key) + want := []string{"kv", "patch", "secret/workstation/claude-users/emo", key + "=-"} + if !reflect.DeepEqual(got, want) { + t.Fatalf("vaultPatchSecretArgs(%q) = %v", key, got) + } + if got[len(got)-1] != key+"=-" { + t.Fatalf("secret value must be read from stdin (`%s=-`), got %v", key, got) + } + } +} + +// TestNoSecretInArgvAcrossFlow is the load-bearing security test: across the +// whole get flow (vault reads, bw config/status/login/unlock/get) NO secret +// value may appear in any command's argv — secrets travel via env/stdin only. +func TestNoSecretInArgvAcrossFlow(t *testing.T) { + uid := fmt.Sprintf("%d", os.Getuid()) + f := &fakeRunner{out: map[string]string{ + "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "SUPERSECRETPW", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "CLIENTSEKRET", + "bw status": `{"status":"locked"}`, + "bw unlock": "SESSIONXYZ", + "bw get password github": "p@ss", + }} + if _, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}); err != nil { + t.Fatalf("getValue: %v", err) + } + for _, call := range f.calls { + for _, arg := range call { + for _, s := range []string{"SUPERSECRETPW", "CLIENTSEKRET", "SESSIONXYZ"} { + if strings.Contains(arg, s) { + t.Errorf("secret %q leaked into argv: %v", s, call) + } + } + } + } + if !strings.Contains(strings.Join(f.lastEnv, "\n"), "BW_SESSION=SESSIONXYZ") { + t.Error("expected BW_SESSION in the bw get env (test would be vacuous otherwise)") + } +} + +func TestClipboardDecision(t *testing.T) { + cases := []struct { + stdoutTTY, stderrTTY bool + term, prog, want string + }{ + {false, true, "xterm-kitty", "", "stdout"}, + {true, true, "xterm-kitty", "", "clipboard"}, + {true, true, "dumb", "", "refuse"}, + {true, false, "xterm-kitty", "", "refuse"}, + } + for _, c := range cases { + if got := clipboardDecision(c.stdoutTTY, c.stderrTTY, c.term, c.prog); got != c.want { + t.Errorf("clipboardDecision(%v,%v,%q) = %q, want %q", c.stdoutTTY, c.stderrTTY, c.term, got, c.want) + } + } +} + +func TestJSONToStdoutOK(t *testing.T) { + if jsonToStdoutOK(true) { + t.Error("must refuse JSON secret on a terminal") + } + if !jsonToStdoutOK(false) { + t.Error("must allow JSON when piped") + } +} + +func TestBwNeedsLogin(t *testing.T) { + if !bwNeedsLogin(`{"status":"unauthenticated"}`) { + t.Error("unauthenticated → needs login") + } + if bwNeedsLogin(`{"status":"locked"}`) { + t.Error("locked → no login (just unlock)") + } + if bwNeedsLogin(`{"status":"unlocked"}`) { + t.Error("unlocked → no login") + } + if !bwNeedsLogin(`not json`) { + t.Error("unparseable → attempt login") + } +} + +func TestVaultHelpMentionsSecurity(t *testing.T) { + h := vaultHelp() + for _, want := range []string{"homelab vault get", "no-HITL", "your own", "setup"} { + if !strings.Contains(h, want) { + t.Errorf("vault help missing %q", want) + } + } +} + +func TestVaultBareGroupRegistered(t *testing.T) { + for _, c := range vaultCommands() { + if len(c.Path) == 1 && c.Path[0] == "vault" { + return + } + } + t.Fatal("bare `vault` help command not registered") +} + +// getValue is the testable core: given a runner + opts, returns the secret value. +func TestGetValueFlow(t *testing.T) { + f := &fakeRunner{out: map[string]string{ + "vault kv get -field=vaultwarden_master_password secret/workstation/claude-users/emo": "pw", + "vault kv get -field=vaultwarden_client_id secret/workstation/claude-users/emo": "user.x", + "vault kv get -field=vaultwarden_client_secret secret/workstation/claude-users/emo": "cs", + "bw status": `{"status":"locked"}`, + "bw unlock": "SESS", + "bw get password github": "p@ss", + }} + // Use real UID so os.MkdirAll(/run/user/<uid>/homelab-bw) succeeds. + uid := fmt.Sprintf("%d", os.Getuid()) + val, err := getValue(f.run, "emo", uid, getOpts{name: "github", field: "password"}) + if err != nil || val != "p@ss" { + t.Fatalf("getValue = %q, %v", val, err) + } +} diff --git a/cli/cmd_work.go b/cli/cmd_work.go new file mode 100644 index 00000000..3bf44e13 --- /dev/null +++ b/cli/cmd_work.go @@ -0,0 +1,212 @@ +package main + +import ( + "fmt" + "os" + "path/filepath" + "strings" +) + +func workCommands() []Command { + return []Command{ + {Path: []string{"work", "start"}, Tier: TierWrite, + Summary: "create a worktree + branch for a task (enter it with EnterWorktree)", Run: workStart}, + {Path: []string{"work", "land"}, Tier: TierWrite, + Summary: "merge master in, verify, push HEAD:master (run from the worktree)", Run: workLand}, + {Path: []string{"work", "clean"}, Tier: TierWrite, + Summary: "remove a task's worktree + branch (run from the main checkout)", Run: workClean}, + } +} + +// flagValue extracts `--name value` or `--name=value` from args. +func flagValue(args []string, name string) string { + for i, a := range args { + if a == name && i+1 < len(args) { + return args[i+1] + } + if strings.HasPrefix(a, name+"=") { + return strings.TrimPrefix(a, name+"=") + } + } + return "" +} + +func remotesOrEmpty(repoRoot string) []string { + r, _ := gitRemotes(repoRoot) + return r +} + +// workStart creates .worktrees/<topic> on branch <user>/<topic> off <remote>/master. +func workStart(args []string) error { + topic, _ := firstPositional(args) + if topic == "" { + return fmt.Errorf("usage: homelab work start <topic>") + } + cwd, _ := os.Getwd() + repoRoot, err := gitRepoRoot(cwd) + if err != nil { + return fmt.Errorf("not in a git repository: %w", err) + } + remote := preferRemote(remotesOrEmpty(repoRoot)) + if remote == "" { + return fmt.Errorf("no git remote configured in %s", repoRoot) + } + flags := cryptFlagsFor(repoRoot) + branch := currentUser() + "/" + topic + wtRel := filepath.Join(".worktrees", topic) + + ensureWorktreesIgnored(repoRoot) + if err := gitStream(repoRoot, flags, "fetch", remote); err != nil { + return fmt.Errorf("fetch %s failed: %w", remote, err) + } + if err := gitStream(repoRoot, flags, "worktree", "add", wtRel, "-b", branch, remote+"/master"); err != nil { + return fmt.Errorf("worktree add failed: %w", err) + } + wtPath := filepath.Join(repoRoot, wtRel) + fmt.Printf("homelab: created worktree %s (branch %s off %s/master)\n", wtPath, branch, remote) + fmt.Printf("homelab: enter it with the native tool: EnterWorktree(path=%q)\n", wtPath) + return nil +} + +// workLand integrates the current branch into master: fetch, merge master in, +// verify, push HEAD:master (retrying on non-fast-forward), with a feature-branch +// fallback when the direct push is rejected (e.g. branch protection). +func workLand(args []string) error { + verifyCmd := flagValue(args, "--verify-cmd") + cwd, _ := os.Getwd() + repoRoot, err := gitRepoRoot(cwd) + if err != nil { + return fmt.Errorf("not in a git repository: %w", err) + } + branch, err := gitOutput(repoRoot, "rev-parse", "--abbrev-ref", "HEAD") + if err != nil { + return err + } + if branch == "master" || branch == "main" { + return fmt.Errorf("refusing to land: already on %s", branch) + } + remote := preferRemote(remotesOrEmpty(repoRoot)) + if remote == "" { + return fmt.Errorf("no git remote configured in %s", repoRoot) + } + flags := cryptFlagsFor(repoRoot) + + if err := gitStream(repoRoot, flags, "fetch", remote); err != nil { + return fmt.Errorf("fetch failed: %w", err) + } + if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil { + return fmt.Errorf("merging %s/master failed — resolve conflicts then re-run `homelab work land`: %w", remote, err) + } + if err := runVerify(repoRoot, verifyCmd, containsArg(args, "--no-verify")); err != nil { + return fmt.Errorf("not landing: %w", err) + } + if err := pushWithRetry(repoRoot, flags, remote, 3); err != nil { + return landFallback(repoRoot, flags, remote, branch, err) + } + fmt.Printf("homelab: landed %s -> %s/master.\n", branch, remote) + if containsArg(args, "--no-ci-watch") { + fmt.Println("homelab: --no-ci-watch set; not waiting for CI.") + return nil + } + landed, _ := gitOutput(repoRoot, "rev-parse", "HEAD") + fmt.Fprintln(os.Stderr, "homelab: watching CI for the landed commit...") + if err := ciWatch([]string{landed}); err != nil { + return fmt.Errorf("landed, but CI did not go green: %w", err) + } + return nil +} + +// runVerify runs the explicit --verify-cmd, else auto-detects (go test). If +// neither is available it REFUSES (returns an error) unless allowSkip is set — +// landing to master unverified must be a deliberate choice (--no-verify). +func runVerify(repoRoot, verifyCmd string, allowSkip bool) error { + if verifyCmd != "" { + fmt.Fprintf(os.Stderr, "homelab: verify: %s\n", verifyCmd) + return runStreamingIn(repoRoot, "sh", "-c", verifyCmd) + } + if isFile(filepath.Join(repoRoot, "go.mod")) { + fmt.Fprintln(os.Stderr, "homelab: verify: go test ./...") + return runStreamingIn(repoRoot, "go", "test", "./...") + } + if allowSkip { + fmt.Fprintln(os.Stderr, "homelab: WARNING: --no-verify set — landing without verification") + return nil + } + return fmt.Errorf("no verification configured for this repo — pass --verify-cmd \"...\" or --no-verify to land without verifying") +} + +// pushWithRetry pushes HEAD:master, recovering from non-fast-forward rejections +// by fetching + merging master and retrying. +func pushWithRetry(repoRoot string, flags []string, remote string, attempts int) error { + var lastErr error + for i := 0; i < attempts; i++ { + if err := gitStream(repoRoot, flags, "push", remote, "HEAD:master"); err == nil { + return nil + } else { + lastErr = err + } + if i < attempts-1 { + fmt.Fprintln(os.Stderr, "homelab: push rejected — fetching + merging master, then retrying") + if err := gitStream(repoRoot, flags, "fetch", remote); err != nil { + return err + } + if err := gitStream(repoRoot, flags, "merge", "--no-edit", remote+"/master"); err != nil { + return err + } + } + } + return fmt.Errorf("push to %s/master failed after %d attempts: %w", remote, attempts, lastErr) +} + +// landFallback pushes the feature branch when the direct master push is rejected +// (e.g. branch protection), so the work isn't lost and a PR can be opened. +func landFallback(repoRoot string, flags []string, remote, branch string, pushErr error) error { + fmt.Fprintf(os.Stderr, "homelab: direct push to master failed (%v)\n", pushErr) + fmt.Fprintf(os.Stderr, "homelab: falling back to pushing the feature branch %q for a PR\n", branch) + if err := gitStream(repoRoot, flags, "push", "-u", remote, branch); err != nil { + return fmt.Errorf("fallback branch push also failed: %w", err) + } + fmt.Printf("homelab: pushed %s to %s. Open a PR to land it (branch protection blocked the direct push).\n", branch, remote) + return nil +} + +// workClean removes a task's worktree and branch. Run from the main checkout. +func workClean(args []string) error { + topic, _ := firstPositional(args) + if topic == "" { + return fmt.Errorf("usage: homelab work clean <topic> (run from the main checkout)") + } + cwd, _ := os.Getwd() + repoRoot, err := gitRepoRoot(cwd) + if err != nil { + return fmt.Errorf("not in a git repository: %w", err) + } + flags := cryptFlagsFor(repoRoot) + wtRel := filepath.Join(".worktrees", topic) + branch := currentUser() + "/" + topic + + if err := gitStream(repoRoot, flags, "worktree", "remove", wtRel); err != nil { + return fmt.Errorf("worktree remove failed (uncommitted changes? run from the main checkout, not the worktree): %w", err) + } + if err := gitStream(repoRoot, flags, "branch", "-d", branch); err != nil { + fmt.Fprintf(os.Stderr, "homelab: note: could not delete branch %s (unmerged — use `git branch -D` if intended): %v\n", branch, err) + } + fmt.Printf("homelab: removed worktree %s and branch %s\n", wtRel, branch) + return nil +} + +// ensureWorktreesIgnored appends .worktrees/ to .gitignore if not already ignored. +func ensureWorktreesIgnored(repoRoot string) { + if _, err := gitOutput(repoRoot, "check-ignore", ".worktrees"); err == nil { + return + } + gi := filepath.Join(repoRoot, ".gitignore") + f, err := os.OpenFile(gi, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0o644) + if err != nil { + return + } + defer f.Close() + if _, err := f.WriteString("\n.worktrees/\n"); err == nil { + fmt.Fprintln(os.Stderr, "homelab: added .worktrees/ to .gitignore") + } +} diff --git a/cli/cmd_work_test.go b/cli/cmd_work_test.go new file mode 100644 index 00000000..af573dd6 --- /dev/null +++ b/cli/cmd_work_test.go @@ -0,0 +1,32 @@ +package main + +import "testing" + +func TestRunVerifyRefusesWhenNothingToVerify(t *testing.T) { + dir := t.TempDir() // no go.mod, no verify cmd + if err := runVerify(dir, "", false); err == nil { + t.Fatal("runVerify must refuse (error) when nothing to verify and --no-verify absent") + } + if err := runVerify(dir, "", true); err != nil { + t.Fatalf("runVerify must skip when --no-verify set, got: %v", err) + } +} + +func TestFlagValue(t *testing.T) { + cases := []struct { + args []string + name string + want string + }{ + {[]string{"--verify-cmd", "go test ./..."}, "--verify-cmd", "go test ./..."}, + {[]string{"--verify-cmd=make test"}, "--verify-cmd", "make test"}, + {[]string{"topic", "--verify-cmd", "x"}, "--verify-cmd", "x"}, + {[]string{"topic"}, "--verify-cmd", ""}, + {[]string{"--verify-cmd"}, "--verify-cmd", ""}, // no value + } + for _, c := range cases { + if got := flagValue(c.args, c.name); got != c.want { + t.Errorf("flagValue(%v, %q) = %q, want %q", c.args, c.name, got, c.want) + } + } +} diff --git a/cli/command.go b/cli/command.go new file mode 100644 index 00000000..55449788 --- /dev/null +++ b/cli/command.go @@ -0,0 +1,104 @@ +package main + +import ( + "encoding/json" + "fmt" + "sort" + "strings" +) + +// Tier classifies whether a command observes (read) or mutates (write) state. +// v0.1 allows everything; the tier is recorded so a classifier hook can gate +// writes later without restructuring (see docs/adr/0005). +type Tier string + +const ( + TierRead Tier = "read" + TierWrite Tier = "write" +) + +// Command is one homelab verb. Path is the token sequence that selects it, +// e.g. ["claim"] or ["tf", "plan"]. Run receives the args after the path. +type Command struct { + Path []string + Tier Tier + Summary string + Run func(args []string) error +} + +// dispatch routes args to the command whose Path is the longest matching prefix +// of args, passing the remaining args to its Run. +func dispatch(reg []Command, args []string) error { + best := -1 + bestLen := 0 + for i, c := range reg { + if len(c.Path) > len(args) { + continue + } + match := true + for j, p := range c.Path { + if args[j] != p { + match = false + break + } + } + if match && len(c.Path) >= bestLen { + best = i + bestLen = len(c.Path) + } + } + if best < 0 { + return fmt.Errorf("unknown command: %q", strings.Join(args, " ")) + } + matched := reg[best] + runErr := matched.Run(args[bestLen:]) + emitUsage(matched.name(), runErr) // best-effort usage telemetry; never affects the command + return runErr +} + +// name is the space-joined verb path, e.g. "tf plan". +func (c Command) name() string { return strings.Join(c.Path, " ") } + +// sortedByName returns a copy of reg ordered by verb path for stable output. +func sortedByName(reg []Command) []Command { + out := make([]Command, len(reg)) + copy(out, reg) + sort.Slice(out, func(i, j int) bool { return out[i].name() < out[j].name() }) + return out +} + +// manifestText renders one aligned line per command: "<path> <tier> <summary>". +// This is the cheap progressive-discovery entrypoint (see docs/adr/0004). +func manifestText(reg []Command) string { + cmds := sortedByName(reg) + width := 0 + for _, c := range cmds { + if n := len(c.name()); n > width { + width = n + } + } + var b strings.Builder + for _, c := range cmds { + fmt.Fprintf(&b, "%-*s %-5s %s\n", width, c.name(), c.Tier, c.Summary) + } + return b.String() +} + +// manifestJSON renders the registry as a JSON array of {command, tier, summary} +// so agents can parse the full surface in one call. +func manifestJSON(reg []Command) (string, error) { + type entry struct { + Command string `json:"command"` + Tier string `json:"tier"` + Summary string `json:"summary"` + } + entries := make([]entry, 0, len(reg)) + for _, c := range sortedByName(reg) { + entries = append(entries, entry{Command: c.name(), Tier: string(c.Tier), Summary: c.Summary}) + } + b, err := json.MarshalIndent(entries, "", " ") + if err != nil { + return "", err + } + return string(b), nil +} diff --git a/cli/command_test.go b/cli/command_test.go new file mode 100644 index 00000000..e686622d --- /dev/null +++ b/cli/command_test.go @@ -0,0 +1,73 @@ +package main + +import ( + "encoding/json" + "reflect" + "strings" + "testing" +) + +// Tracer bullet: the dispatcher must route `homelab <path...> <args...>` to the +// command whose Path is the longest matching prefix of the input tokens, and +// hand the command the remaining args. +func TestDispatchRoutesToLongestPrefixMatch(t *testing.T) { + var gotArgs []string + ran := "" + reg := []Command{ + {Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource", + Run: func(a []string) error { ran = "claim"; gotArgs = a; return nil }}, + {Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack", + Run: func(a []string) error { ran = "tf plan"; gotArgs = a; return nil }}, + } + + if err := dispatch(reg, []string{"tf", "plan", "vault", "--json"}); err != nil { + t.Fatalf("dispatch returned error: %v", err) + } + if ran != "tf plan" { + t.Fatalf("routed to %q, want %q", ran, "tf plan") + } + if want := []string{"vault", "--json"}; !reflect.DeepEqual(gotArgs, want) { + t.Fatalf("command got args %v, want %v", gotArgs, want) + } +} + +func TestDispatchUnknownCommandErrors(t *testing.T) { + reg := []Command{{Path: []string{"claim"}, Run: func(a []string) error { return nil }}} + if err := dispatch(reg, []string{"bogus"}); err == nil { + t.Fatal("expected error for unknown command, got nil") + } +} + +// The manifest is the progressive-discovery entrypoint: one line per command +// showing the full verb path, its tier, and summary, sorted for stable output. +func TestManifestTextListsEveryCommandWithTier(t *testing.T) { + reg := []Command{ + {Path: []string{"tf", "plan"}, Tier: TierRead, Summary: "plan a stack"}, + {Path: []string{"claim"}, Tier: TierWrite, Summary: "claim a resource"}, + } + out := manifestText(reg) + for _, want := range []string{"claim", "tf plan", "read", "write", "plan a stack", "claim a resource"} { + if !strings.Contains(out, want) { + t.Errorf("manifest text missing %q\n---\n%s", want, out) + } + } + // sorted: claim (c) must appear before tf plan (t) + if strings.Index(out, "claim") > strings.Index(out, "tf plan") { + t.Errorf("manifest not sorted by path:\n%s", out) + } +} + +func TestManifestJSONIsParsableAndTagged(t *testing.T) { + reg := []Command{{Path: []string{"tf", "apply"}, Tier: TierWrite, Summary: "apply a stack"}} + out, err := manifestJSON(reg) + if err != nil { + t.Fatalf("manifestJSON error: %v", err) + } + var got []map[string]string + if err := json.Unmarshal([]byte(out), &got); err != nil { + t.Fatalf("manifest JSON not parsable: %v\n%s", err, out) + } + if len(got) != 1 || got[0]["command"] != "tf apply" || got[0]["tier"] != "write" { + t.Fatalf("unexpected manifest JSON: %v", got) + } +} diff --git a/cli/homelab.go b/cli/homelab.go new file mode 100644 index 00000000..62c0c8aa --- /dev/null +++ b/cli/homelab.go @@ -0,0 +1,98 @@ +package main + +import ( + "fmt" + "strings" +) + +// version is stamped at build time via -ldflags "-X main.version=vX.Y.Z". +var version = "dev" + +// buildRegistry returns every homelab verb. New verb-groups append here. +func buildRegistry() []Command { + var reg []Command + reg = append(reg, claimCommands()...) + reg = append(reg, tfCommands()...) + reg = append(reg, workCommands()...) + reg = append(reg, k8sCommands()...) + reg = append(reg, memoryCommands()...) + reg = append(reg, ciCommands()...) + reg = append(reg, deployCommands()...) + reg = append(reg, netCommands()...) + reg = append(reg, obsCommands()...) + reg = append(reg, usageCommands()...) + reg = append(reg, haCommands()...) + reg = append(reg, browserCommands()...) + reg = append(reg, vaultCommands()...) + return reg +} + +// dispatchTop handles the homelab verb surface. handled=false means the args are +// not a homelab verb, so main() falls back to the legacy -use-case path. +func dispatchTop(args []string) (handled bool, err error) { + if len(args) == 0 { + fmt.Print(usage()) + return true, nil + } + switch args[0] { + case "help", "-h", "--help": + fmt.Print(usage()) + return true, nil + case "version", "--version": + fmt.Println("homelab " + version) + return true, nil + case "manifest": + reg := buildRegistry() + if containsArg(args[1:], "--json") { + out, err := manifestJSON(reg) + if err != nil { + return true, err + } + fmt.Println(out) + return true, nil + } + fmt.Print(manifestText(reg)) + return true, nil + } + if strings.HasPrefix(args[0], "-") { + return false, nil + } + reg := buildRegistry() + if !isCommandGroup(reg, args[0]) { + return false, nil + } + return true, dispatch(reg, args) +} + +func isCommandGroup(reg []Command, group string) bool { + for _, c := range reg { + if len(c.Path) > 0 && c.Path[0] == group { + return true + } + } + return false +} + +func containsArg(args []string, want string) bool { + for _, a := range args { + if a == want { + return true + } + } + return false +} + +func usage() string { + var b strings.Builder + fmt.Fprintf(&b, "homelab %s — unified homelab operations CLI\n\n", version) + b.WriteString("Usage:\n homelab <command> [args]\n\nCommands:\n") + for _, line := range strings.Split(strings.TrimRight(manifestText(buildRegistry()), "\n"), "\n") { + if line != "" { + b.WriteString(" " + line + "\n") + } + } + b.WriteString("\n manifest [--json] list all commands (machine-readable with --json)\n") + b.WriteString(" version print version\n") + b.WriteString("\nLegacy webhook use-cases remain available via -use-case=<name>.\n") + return b.String() +} diff --git a/cli/k8s.go b/cli/k8s.go new file mode 100644 index 00000000..3a2d0a5d --- /dev/null +++ b/cli/k8s.go @@ -0,0 +1,138 @@ +package main + +import ( + "fmt" + "os/exec" + "strings" +) + +// kubectl helpers use the ambient kubeconfig (no per-call auth flags). + +func kubectlBase(ns string, args ...string) []string { + var full []string + if ns != "" { + full = append(full, "-n", ns) + } + return append(full, args...) +} + +func kubectlStream(ns string, args ...string) error { + return runStreamingIn("", "kubectl", kubectlBase(ns, args...)...) +} + +// kubectlCapture runs kubectl and returns trimmed stdout (for resolving pods). +func kubectlCapture(ns string, args ...string) (string, error) { + out, err := exec.Command("kubectl", kubectlBase(ns, args...)...).Output() + return strings.TrimSpace(string(out)), err +} + +// k8sTarget is the parsed `<app>` + selectors shared by the k8s verbs. +type k8sTarget struct { + app string + ns string + pod string + container string + selector string + tty bool + rest []string // passthrough flags and, after `--`, the exec command +} + +// parseK8sTarget reads `<app> [-n ns] [--pod p] [-c ctr] [-l sel] [flags] [-- cmd]`. +// The first bare token is the app; unknown flags pass through in rest. +func parseK8sTarget(args []string) k8sTarget { + t := k8sTarget{} + i := 0 + take := func() string { + if i+1 < len(args) { + i++ + return args[i] + } + return "" + } + for i = 0; i < len(args); i++ { + a := args[i] + switch { + case a == "--": + t.rest = append(t.rest, args[i+1:]...) + return t + case a == "-n" || a == "--namespace": + t.ns = take() + case strings.HasPrefix(a, "--namespace="): + t.ns = strings.TrimPrefix(a, "--namespace=") + case a == "--pod": + t.pod = take() + case strings.HasPrefix(a, "--pod="): + t.pod = strings.TrimPrefix(a, "--pod=") + case a == "-c" || a == "--container": + t.container = take() + case strings.HasPrefix(a, "--container="): + t.container = strings.TrimPrefix(a, "--container=") + case a == "-l" || a == "--selector": + t.selector = take() + case strings.HasPrefix(a, "--selector="): + t.selector = strings.TrimPrefix(a, "--selector=") + case a == "--tty" || a == "-it" || a == "-ti": + t.tty = true + case !strings.HasPrefix(a, "-") && t.app == "": + t.app = a + default: + t.rest = append(t.rest, a) + } + } + return t +} + +// namespace defaults to the app name (most namespaces hold exactly one app). +func (t k8sTarget) namespace() string { + if t.ns != "" { + return t.ns + } + return t.app +} + +// objectRef is the kubectl object for logs/exec: an explicit pod, else +// deploy/<app> (kubectl resolves a pod from the Deployment). +func (t k8sTarget) objectRef() string { + if t.pod != "" { + return "pod/" + t.pod + } + return "deploy/" + t.app +} + +// --- database access (the dbaas exec pattern) --- + +type dbPlan struct { + ns string + pod string // explicit pod (e.g. mysql-standalone-0) + selector string // resolve the pod by this label when pod == "" (CNPG primary) + container string // "" = default container + argv []string // command + args to run inside the pod +} + +// planDBExec builds the in-pod command to run sql against app's database. +// PG (default): CNPG primary POD (resolved by label — pg-cluster-rw is a +// Service, not an exec target), psql -U postgres -d <db>. +// MySQL: mysql-standalone-0, password from env (never on the command line). +// dbName defaults to app. sql empty => interactive client. +func planDBExec(app, dbName, sql string, mysql bool) dbPlan { + if dbName == "" { + dbName = app + } + if mysql { + inner := fmt.Sprintf(`mysql -u root -p"$MYSQL_ROOT_PASSWORD" %s`, shellQuote(dbName)) + if sql != "" { + inner += " -e " + shellQuote(sql) + } + return dbPlan{ns: "dbaas", pod: "mysql-standalone-0", argv: []string{"bash", "-c", inner}} + } + argv := []string{"psql", "-U", "postgres", "-d", dbName} + if sql != "" { + argv = append(argv, "-tAc", sql) + } + return dbPlan{ns: "dbaas", selector: "cnpg.io/instanceRole=primary", container: "postgres", argv: argv} +} + +// shellQuote single-quotes s for safe embedding in a bash -c string. +func shellQuote(s string) string { + return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'" +} diff --git a/cli/k8s_test.go b/cli/k8s_test.go new file mode 100644 index 00000000..cfa356bc --- /dev/null +++ b/cli/k8s_test.go @@ -0,0 +1,65 @@ +package main + +import ( + "reflect" + "strings" + "testing" +) + +func TestParseK8sTarget(t *testing.T) { + got := parseK8sTarget([]string{"tripit", "-n", "prod", "--pod", "x-123", "-c", "app", "-l", "k=v", "--tail=50", "--", "ls", "-la"}) + want := k8sTarget{app: "tripit", ns: "prod", pod: "x-123", container: "app", selector: "k=v", rest: []string{"--tail=50", "ls", "-la"}} + if !reflect.DeepEqual(got, want) { + t.Fatalf("parseK8sTarget =\n %+v\nwant\n %+v", got, want) + } +} + +func TestK8sTargetNamespaceDefaultsToApp(t *testing.T) { + if ns := parseK8sTarget([]string{"immich"}).namespace(); ns != "immich" { + t.Errorf("namespace() = %q, want immich", ns) + } + if ns := parseK8sTarget([]string{"immich", "-n", "dbaas"}).namespace(); ns != "dbaas" { + t.Errorf("namespace() = %q, want dbaas", ns) + } +} + +func TestK8sTargetObjectRef(t *testing.T) { + if r := parseK8sTarget([]string{"tripit"}).objectRef(); r != "deploy/tripit" { + t.Errorf("objectRef() = %q, want deploy/tripit", r) + } + if r := parseK8sTarget([]string{"tripit", "--pod", "tripit-abc"}).objectRef(); r != "pod/tripit-abc" { + t.Errorf("objectRef() = %q, want pod/tripit-abc", r) + } +} + +func TestPlanDBExecPostgresDefault(t *testing.T) { + p := planDBExec("fire-planner", "", "SELECT 1", false) + // pg-cluster-rw is a Service, so the PG plan resolves the primary POD by + // label rather than naming an (un-exec-able) Service. + if p.ns != "dbaas" || p.pod != "" || p.selector != "cnpg.io/instanceRole=primary" || p.container != "postgres" { + t.Fatalf("unexpected pg target: %+v", p) + } + // db name defaults to the app; SQL passed via -tAc + joined := strings.Join(p.argv, " ") + if !strings.Contains(joined, "-d fire-planner") || !strings.Contains(joined, "-tAc") { + t.Fatalf("pg argv missing db/sql: %v", p.argv) + } +} + +func TestPlanDBExecMysqlEnvPassword(t *testing.T) { + p := planDBExec("wrongmove", "wrongmove", "SHOW TABLES", true) + if p.pod != "mysql-standalone-0" { + t.Fatalf("unexpected mysql pod: %+v", p) + } + inner := strings.Join(p.argv, " ") + // password must come from the env var, never inline + if !strings.Contains(inner, `-p"$MYSQL_ROOT_PASSWORD"`) { + t.Fatalf("mysql must use env password wrapper: %v", p.argv) + } +} + +func TestShellQuoteEscapes(t *testing.T) { + if got := shellQuote("a'b"); got != `'a'\''b'` { + t.Fatalf("shellQuote = %q", got) + } +} diff --git a/cli/main.go b/cli/main.go index 3b9fee1c..a53f7672 100644 --- a/cli/main.go +++ b/cli/main.go @@ -26,8 +26,16 @@ var ( ) func main() { - err := run() - if err != nil { + // homelab verb surface (work/tf/claim/...) is tried first; if the args are + // not a homelab verb, fall through to the legacy webhook -use-case path. + if handled, err := dispatchTop(os.Args[1:]); handled { + if err != nil { + fmt.Fprintln(os.Stderr, "homelab: "+err.Error()) + os.Exit(1) + } + return + } + if err := run(); err != nil { glog.Errorf("run failed: %s", err.Error()) os.Exit(255) } diff --git a/cli/memory.go b/cli/memory.go new file mode 100644 index 00000000..286ee5bb --- /dev/null +++ b/cli/memory.go @@ -0,0 +1,103 @@ +package main + +import ( + "bytes" + "encoding/json" + "fmt" + "io" + "net/http" + "os" + "strings" + "time" +) + +// defaultMemoryURL is used when no env override is present (agents normally have +// CLAUDE_MEMORY_API_URL set by the memory hooks). +const defaultMemoryURL = "https://claude-memory.viktorbarzin.me" + +type memoryClient struct { + base string + key string + http *http.Client +} + +func firstEnv(keys ...string) string { + for _, k := range keys { + if v := os.Getenv(k); v != "" { + return v + } + } + return "" +} + +func resolveMemoryBase() string { + if b := firstEnv("CLAUDE_MEMORY_API_URL", "MEMORY_API_URL"); b != "" { + return strings.TrimRight(b, "/") + } + return defaultMemoryURL +} + +// newMemoryClient talks straight to the claude-memory HTTP API (the same backend +// the MCP wraps), so it works even when the MCP frontend is down. +func newMemoryClient() (*memoryClient, error) { + key := firstEnv("CLAUDE_MEMORY_API_KEY", "MEMORY_API_KEY") + if key == "" { + return nil, fmt.Errorf("no memory API key — set CLAUDE_MEMORY_API_KEY (or MEMORY_API_KEY)") + } + return &memoryClient{base: resolveMemoryBase(), key: key, http: &http.Client{Timeout: 30 * time.Second}}, nil +} + +func (c *memoryClient) do(method, path string, body interface{}) ([]byte, error) { + var r io.Reader + if body != nil { + b, err := json.Marshal(body) + if err != nil { + return nil, err + } + r = bytes.NewReader(b) + } + req, err := http.NewRequest(method, c.base+path, r) + if err != nil { + return nil, err + } + req.Header.Set("Authorization", "Bearer "+c.key) + if body != nil { + req.Header.Set("Content-Type", "application/json") + } + resp, err := c.http.Do(req) + if err != nil { + return nil, err + } + defer resp.Body.Close() + out, _ := io.ReadAll(resp.Body) + if resp.StatusCode >= 300 { + return nil, fmt.Errorf("memory API %s %s -> %d: %s", method, path, resp.StatusCode, strings.TrimSpace(string(out))) + } + return out, nil +} + +// Request bodies mirror src/claude_memory/api/models.py. + +type memRecallReq struct { + Context string `json:"context"` + ExpandedQuery string `json:"expanded_query,omitempty"` + Category string `json:"category,omitempty"` + SortBy string `json:"sort_by,omitempty"` + Limit int `json:"limit,omitempty"` +} + +type memStoreReq struct { + Content string `json:"content"` + Category string `json:"category,omitempty"` + Tags string `json:"tags,omitempty"` + ExpandedKeywords string `json:"expanded_keywords,omitempty"` + Importance float64 `json:"importance"` + ForceSensitive bool `json:"force_sensitive,omitempty"` +} + +type memUpdateReq struct { + Content *string `json:"content,omitempty"` + Tags *string `json:"tags,omitempty"` + Importance *float64 `json:"importance,omitempty"` + ExpandedKeywords *string `json:"expanded_keywords,omitempty"` +} diff --git a/cli/memory_test.go b/cli/memory_test.go new file mode 100644 index 00000000..7b14ef20 --- /dev/null +++ b/cli/memory_test.go @@ -0,0 +1,51 @@ +package main + +import ( + "encoding/json" + "os" + "strings" + "testing" +) + +func TestResolveMemoryBase(t *testing.T) { + old1, old2 := os.Getenv("CLAUDE_MEMORY_API_URL"), os.Getenv("MEMORY_API_URL") + defer func() { os.Setenv("CLAUDE_MEMORY_API_URL", old1); os.Setenv("MEMORY_API_URL", old2) }() + + os.Unsetenv("CLAUDE_MEMORY_API_URL") + os.Unsetenv("MEMORY_API_URL") + if got := resolveMemoryBase(); got != defaultMemoryURL { + t.Errorf("resolveMemoryBase() = %q, want default %q", got, defaultMemoryURL) + } + os.Setenv("CLAUDE_MEMORY_API_URL", "https://m.example/") // trailing slash trimmed + if got := resolveMemoryBase(); got != "https://m.example" { + t.Errorf("resolveMemoryBase() = %q, want https://m.example", got) + } +} + +func TestMemStoreReqAlwaysSendsImportance(t *testing.T) { + b, _ := json.Marshal(memStoreReq{Content: "x", Category: "facts", Importance: 0.5}) + s := string(b) + if !strings.Contains(s, `"content":"x"`) || !strings.Contains(s, `"importance":0.5`) { + t.Fatalf("memStoreReq JSON missing fields: %s", s) + } +} + +func TestMemUpdateReqOmitsUnsetFields(t *testing.T) { + tags := "a,b" + b, _ := json.Marshal(memUpdateReq{Tags: &tags}) + s := string(b) + if strings.Contains(s, "content") || strings.Contains(s, "importance") { + t.Fatalf("unset update fields must be omitted: %s", s) + } + if !strings.Contains(s, `"tags":"a,b"`) { + t.Fatalf("set field missing: %s", s) + } +} + +func TestMemRecallReqOmitsEmptyOptionals(t *testing.T) { + b, _ := json.Marshal(memRecallReq{Context: "hi"}) + s := string(b) + if strings.Contains(s, "expanded_query") || strings.Contains(s, "category") || strings.Contains(s, "limit") { + t.Fatalf("empty optionals must be omitted: %s", s) + } +} diff --git a/cli/presence.go b/cli/presence.go new file mode 100644 index 00000000..bcf054d7 --- /dev/null +++ b/cli/presence.go @@ -0,0 +1,58 @@ +package main + +import ( + "fmt" + "os" + "path/filepath" + "strings" +) + +// validPresenceKinds is the fixed label taxonomy accepted by the presence board. +var validPresenceKinds = []string{"node", "host", "stack", "service", "db", "pvc", "infra"} + +// presenceScript locates the presence CLI — homelab WRAPS it, it does not +// reimplement it. Override with HOMELAB_PRESENCE; defaults to ~/code/scripts/presence. +func presenceScript() string { + if p := os.Getenv("HOMELAB_PRESENCE"); p != "" { + return p + } + home, err := os.UserHomeDir() + if err != nil { + return "presence" + } + return filepath.Join(home, "code", "scripts", "presence") +} + +// validateLabel checks a presence label is <kind>:<name> with a known kind. +func validateLabel(label string) error { + parts := strings.SplitN(label, ":", 2) + if len(parts) != 2 || parts[0] == "" || parts[1] == "" { + return fmt.Errorf("label must be <kind>:<name> (e.g. stack:vault), got %q", label) + } + for _, k := range validPresenceKinds { + if parts[0] == k { + return nil + } + } + return fmt.Errorf("invalid label kind %q; valid kinds: %s", parts[0], strings.Join(validPresenceKinds, ", ")) +} + +// presenceClaim claims label on the board with a purpose note. +func presenceClaim(label, purpose string) error { + if err := validateLabel(label); err != nil { + return err + } + args := []string{"claim", label} + if purpose != "" { + args = append(args, "--purpose", purpose) + } + return runStreaming(presenceScript(), args...) +} + +// presenceRelease releases a prior claim on label. +func presenceRelease(label string) error { + if err := validateLabel(label); err != nil { + return err + } + return runStreaming(presenceScript(), "release", label) +} diff --git a/cli/presence_test.go b/cli/presence_test.go new file mode 100644 index 00000000..3d1596e1 --- /dev/null +++ b/cli/presence_test.go @@ -0,0 +1,24 @@ +package main + +import "testing" + +func TestValidateLabelAcceptsTaxonomy(t *testing.T) { + good := []string{ + "stack:vault", "service:health", "node:k8s-node1", "db:pg-cluster", + "infra:gpu-operator", "host:proxmox-1", "pvc:dbaas/data", + } + for _, l := range good { + if err := validateLabel(l); err != nil { + t.Errorf("validateLabel(%q) = %v, want nil", l, err) + } + } +} + +func TestValidateLabelRejectsBadLabels(t *testing.T) { + bad := []string{"vault", "stack:", "bogus:x", ":x", "stack", ""} + for _, l := range bad { + if err := validateLabel(l); err == nil { + t.Errorf("validateLabel(%q) = nil, want error", l) + } + } +} diff --git a/cli/probe.go b/cli/probe.go new file mode 100644 index 00000000..25d148a0 --- /dev/null +++ b/cli/probe.go @@ -0,0 +1,76 @@ +package main + +import ( + "context" + "crypto/tls" + "fmt" + "io" + "net" + "net/http" + "net/url" + "os/exec" + "strings" + "time" +) + +// internalLBIP is the dedicated Traefik LB; every internal ingress routes through it. +const internalLBIP = "10.0.20.203" + +// clientDialingIP returns an http.Client that dials ip for ANY host while keeping +// the URL host as SNI (so the cert matches) — the Go form of `curl --resolve +// host:443:ip`. TLS verification is skipped (these are reachability/observability +// probes, not security checks; internal .lan vhosts may serve a non-matching cert). +func clientDialingIP(ip string, timeout time.Duration) *http.Client { + d := &net.Dialer{Timeout: 8 * time.Second} + tr := &http.Transport{ + DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) { + if i := strings.LastIndex(addr, ":"); i >= 0 { + addr = ip + addr[i:] + } + return d.DialContext(ctx, network, addr) + }, + TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, + } + return &http.Client{Timeout: timeout, Transport: tr} +} + +// probeURL issues a GET and returns status code + elapsed time. +func probeURL(c *http.Client, rawurl string) (int, time.Duration, error) { + start := time.Now() + resp, err := c.Get(rawurl) + dur := time.Since(start) + if err != nil { + return 0, dur, err + } + resp.Body.Close() + return resp.StatusCode, dur, nil +} + +// lbGetBody GETs https://<host><path>?<q> through the internal LB and returns the body. +func lbGetBody(host, path string, q url.Values) ([]byte, error) { + u := "https://" + host + path + if len(q) > 0 { + u += "?" + q.Encode() + } + resp, err := clientDialingIP(internalLBIP, 20*time.Second).Get(u) + if err != nil { + return nil, err + } + defer resp.Body.Close() + body, _ := io.ReadAll(resp.Body) + if resp.StatusCode >= 300 { + return nil, fmt.Errorf("%s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body))) + } + return body, nil +} + +// dig runs `dig +short` against a resolver, optionally for a record type. +func dig(name, server, rrtype string) (string, error) { + args := []string{"+short", "+time=3", "+tries=1"} + if rrtype != "" { + args = append(args, rrtype) + } + args = append(args, name, "@"+server) + out, err := exec.Command("dig", args...).Output() + return strings.TrimSpace(string(out)), err +} diff --git a/cli/probe_test.go b/cli/probe_test.go new file mode 100644 index 00000000..bec4d132 --- /dev/null +++ b/cli/probe_test.go @@ -0,0 +1,49 @@ +package main + +import "testing" + +func TestQueryArg(t *testing.T) { + if got := queryArg([]string{"up"}, nil); got != "up" { + t.Errorf(`queryArg(["up"]) = %q, want "up"`, got) + } + if got := queryArg([]string{"up", "--json"}, nil); got != "up" { + t.Errorf(`--json should be dropped, got %q`, got) + } + // single quoted PromQL arrives as one token + if got := queryArg([]string{"count by (node) (up)", "--json"}, nil); got != "count by (node) (up)" { + t.Errorf(`quoted query mangled: %q`, got) + } + // value-flags and their values are skipped, query survives + vf := map[string]bool{"--since": true, "--limit": true} + if got := queryArg([]string{`{app="x"}`, "--since", "1h", "--limit", "50"}, vf); got != `{app="x"}` { + t.Errorf(`value-flag skipping failed: %q`, got) + } +} + +func TestLabelStr(t *testing.T) { + got := labelStr(map[string]string{"__name__": "up", "job": "x", "instance": "y"}) + if got != "up{instance=y,job=x}" { // __name__ extracted, rest sorted + t.Errorf("labelStr = %q", got) + } + if got := labelStr(map[string]string{"alertname": "Foo"}); got != "{alertname=Foo}" { + t.Errorf("labelStr (no __name__) = %q", got) + } +} + +func TestOneLineList(t *testing.T) { + if got := oneLineList(" "); got != "(none)" { + t.Errorf("empty = %q, want (none)", got) + } + if got := oneLineList("a\nb"); got != "a, b" { + t.Errorf("multi = %q, want 'a, b'", got) + } +} + +func TestHostOnly(t *testing.T) { + if got := hostOnly("foo.me/path"); got != "foo.me" { + t.Errorf("hostOnly = %q", got) + } + if got := hostOnly("foo.me"); got != "foo.me" { + t.Errorf("hostOnly = %q", got) + } +} diff --git a/cli/repo.go b/cli/repo.go new file mode 100644 index 00000000..3e0dc4f1 --- /dev/null +++ b/cli/repo.go @@ -0,0 +1,101 @@ +package main + +import ( + "os" + "os/exec" + "os/user" + "path/filepath" + "strings" +) + +// preferRemote picks the canonical remote: forgejo if present, else origin, +// else the first listed. (For infra, origin and forgejo both point at Forgejo.) +func preferRemote(remotes []string) string { + has := map[string]bool{} + for _, r := range remotes { + has[r] = true + } + switch { + case has["forgejo"]: + return "forgejo" + case has["origin"]: + return "origin" + case len(remotes) > 0: + return remotes[0] + default: + return "" + } +} + +// hasGitCryptAttr reports whether .gitattributes content enables git-crypt. +func hasGitCryptAttr(gitattributes string) bool { + return strings.Contains(gitattributes, "filter=git-crypt") +} + +// gitCryptFlags are the per-command flags that disable smudge/clean so git +// operations in a git-crypt repo don't try to decrypt (NEVER persisted to config). +func gitCryptFlags() []string { + return []string{ + "-c", "filter.git-crypt.smudge=cat", + "-c", "filter.git-crypt.clean=cat", + "-c", "filter.git-crypt.required=false", + } +} + +// gitOutput runs `git -C dir <args>` and returns trimmed stdout. +func gitOutput(dir string, args ...string) (string, error) { + cmd := exec.Command("git", append([]string{"-C", dir}, args...)...) + out, err := cmd.Output() + return strings.TrimSpace(string(out)), err +} + +func gitRepoRoot(dir string) (string, error) { + return gitOutput(dir, "rev-parse", "--show-toplevel") +} + +// gitRemotes lists configured remote names for the repo at dir. +func gitRemotes(dir string) ([]string, error) { + out, err := gitOutput(dir, "remote") + if err != nil { + return nil, err + } + if out == "" { + return nil, nil + } + return strings.Split(out, "\n"), nil +} + +// isGitCryptRepo reports whether the repo at repoRoot uses git-crypt. +func isGitCryptRepo(repoRoot string) bool { + b, err := os.ReadFile(filepath.Join(repoRoot, ".gitattributes")) + if err != nil { + return false + } + return hasGitCryptAttr(string(b)) +} + +// cryptFlagsFor returns the git-crypt filter flags when repoRoot is encrypted, +// else nil. These are injected per-command and never persisted. +func cryptFlagsFor(repoRoot string) []string { + if isGitCryptRepo(repoRoot) { + return gitCryptFlags() + } + return nil +} + +// gitStream runs `git [cryptFlags] -C repoRoot <args>` with live output. +func gitStream(repoRoot string, cryptFlags []string, args ...string) error { + full := append(append([]string{}, cryptFlags...), append([]string{"-C", repoRoot}, args...)...) + return runStreamingIn("", "git", full...) +} + +// currentUser returns the OS username for branch naming (<user>/<topic>). +func currentUser() string { + if u := os.Getenv("USER"); u != "" { + return u + } + if u, err := user.Current(); err == nil && u.Username != "" { + return u.Username + } + return "user" +} diff --git a/cli/repo_test.go b/cli/repo_test.go new file mode 100644 index 00000000..76cf21a7 --- /dev/null +++ b/cli/repo_test.go @@ -0,0 +1,37 @@ +package main + +import "testing" + +func TestPreferRemote(t *testing.T) { + cases := []struct { + in []string + want string + }{ + {[]string{"origin", "forgejo"}, "forgejo"}, + {[]string{"forgejo"}, "forgejo"}, + {[]string{"origin"}, "origin"}, + {[]string{"upstream"}, "upstream"}, + {nil, ""}, + } + for _, c := range cases { + if got := preferRemote(c.in); got != c.want { + t.Errorf("preferRemote(%v) = %q, want %q", c.in, got, c.want) + } + } +} + +func TestHasGitCryptAttr(t *testing.T) { + if !hasGitCryptAttr("*.tfvars filter=git-crypt diff=git-crypt") { + t.Error("expected git-crypt detected") + } + if hasGitCryptAttr("*.md text\n*.png binary") { + t.Error("expected no git-crypt") + } +} + +func TestGitCryptFlagsShape(t *testing.T) { + f := gitCryptFlags() + if len(f) != 6 || f[0] != "-c" || f[1] != "filter.git-crypt.smudge=cat" { + t.Fatalf("unexpected git-crypt flags: %v", f) + } +} diff --git a/cli/run.go b/cli/run.go new file mode 100644 index 00000000..22e7f17a --- /dev/null +++ b/cli/run.go @@ -0,0 +1,23 @@ +package main + +import ( + "os" + "os/exec" +) + +// runStreaming executes name with args, wiring std streams to this process so +// the caller sees live output, and returns the command's error (non-nil on +// non-zero exit — preserved so homelab's own exit code reflects the child's). +func runStreaming(name string, args ...string) error { + return runStreamingIn("", name, args...) +} + +// runStreamingIn is runStreaming with a working directory (empty = inherit). +func runStreamingIn(dir, name string, args ...string) error { + cmd := exec.Command(name, args...) + cmd.Dir = dir + cmd.Stdout = os.Stdout + cmd.Stderr = os.Stderr + cmd.Stdin = os.Stdin + return cmd.Run() +} diff --git a/cli/stack.go b/cli/stack.go new file mode 100644 index 00000000..1cfdd8d0 --- /dev/null +++ b/cli/stack.go @@ -0,0 +1,54 @@ +package main + +import ( + "fmt" + "os" + "path/filepath" + "sort" + "strings" +) + +// findInfraRoot walks up from start to the infra repo root — the directory +// holding both terragrunt.hcl and a stacks/ directory. +func findInfraRoot(start string) (string, error) { + dir := start + for { + if isFile(filepath.Join(dir, "terragrunt.hcl")) && isDir(filepath.Join(dir, "stacks")) { + return dir, nil + } + parent := filepath.Dir(dir) + if parent == dir { + return "", fmt.Errorf("not inside an infra checkout (no terragrunt.hcl + stacks/ found above %s)", start) + } + dir = parent + } +} + +// resolveStack maps a bare stack name to its directory under <infraRoot>/stacks. +func resolveStack(infraRoot, name string) (string, error) { + dir := filepath.Join(infraRoot, "stacks", name) + if isDir(dir) { + return dir, nil + } + avail := listStacks(infraRoot) + return "", fmt.Errorf("stack %q not found under stacks/; available: %s", name, strings.Join(avail, ", ")) +} + +// listStacks returns the sorted names of every directory under <infraRoot>/stacks. +func listStacks(infraRoot string) []string { + entries, err := os.ReadDir(filepath.Join(infraRoot, "stacks")) + if err != nil { + return nil + } + var out []string + for _, e := range entries { + if e.IsDir() { + out = append(out, e.Name()) + } + } + sort.Strings(out) + return out +} + +func isFile(p string) bool { fi, err := os.Stat(p); return err == nil && !fi.IsDir() } +func isDir(p string) bool { fi, err := os.Stat(p); return err == nil && fi.IsDir() } diff --git a/cli/stack_test.go b/cli/stack_test.go new file mode 100644 index 00000000..2967dc18 --- /dev/null +++ b/cli/stack_test.go @@ -0,0 +1,52 @@ +package main + +import ( + "os" + "path/filepath" + "testing" +) + +func newInfraTree(t *testing.T, stacks ...string) string { + t.Helper() + root := t.TempDir() + if err := os.WriteFile(filepath.Join(root, "terragrunt.hcl"), []byte("# root"), 0o644); err != nil { + t.Fatal(err) + } + for _, s := range stacks { + if err := os.MkdirAll(filepath.Join(root, "stacks", s), 0o755); err != nil { + t.Fatal(err) + } + } + return root +} + +func TestFindInfraRootWalksUp(t *testing.T) { + root := newInfraTree(t, "vault") + got, err := findInfraRoot(filepath.Join(root, "stacks", "vault")) + if err != nil { + t.Fatalf("findInfraRoot error: %v", err) + } + if got != root { + t.Fatalf("findInfraRoot = %q, want %q", got, root) + } +} + +func TestFindInfraRootErrorsOutsideInfra(t *testing.T) { + if _, err := findInfraRoot(t.TempDir()); err == nil { + t.Fatal("expected error outside an infra checkout") + } +} + +func TestResolveStack(t *testing.T) { + root := newInfraTree(t, "vault", "monitoring") + dir, err := resolveStack(root, "vault") + if err != nil { + t.Fatalf("resolveStack error: %v", err) + } + if want := filepath.Join(root, "stacks", "vault"); dir != want { + t.Fatalf("resolveStack = %q, want %q", dir, want) + } + if _, err := resolveStack(root, "nonesuch"); err == nil { + t.Fatal("expected error for unknown stack") + } +} diff --git a/cli/telemetry.go b/cli/telemetry.go new file mode 100644 index 00000000..b0bb625a --- /dev/null +++ b/cli/telemetry.go @@ -0,0 +1,62 @@ +package main + +import ( + "bytes" + "encoding/json" + "net/http" + "os" + "strconv" + "strings" + "time" +) + +// usageJob is the Loki stream job label for homelab usage telemetry. +const usageJob = "homelab-usage" + +// emitUsage best-effort records one verb invocation to Loki for cross-user +// usage analytics. Labels are low-cardinality (job/user/verb); the line carries +// only exit code + CLI version. NEVER args, paths, flags, or secrets. It must +// never affect the command: all errors are swallowed and a tight timeout bounds +// the cost. Opt out with HOMELAB_TELEMETRY=0. +func emitUsage(verb string, runErr error) { + switch os.Getenv("HOMELAB_TELEMETRY") { + case "0", "off", "false", "no": + return + } + if verb == "" || strings.HasPrefix(verb, "usage") { + return // don't self-record the analytics reader + } + exit := 0 + if runErr != nil { + exit = 1 + } + body, err := json.Marshal(lokiPush{Streams: []lokiStream{{ + Stream: map[string]string{"job": usageJob, "user": currentUser(), "verb": verb}, + Values: [][2]string{{ + strconv.FormatInt(time.Now().UnixNano(), 10), + "exit=" + strconv.Itoa(exit) + " ver=" + version, + }}, + }}}) + if err != nil { + return + } + req, err := http.NewRequest("POST", "https://"+lokiHost+"/loki/api/v1/push", bytes.NewReader(body)) + if err != nil { + return + } + req.Header.Set("Content-Type", "application/json") + resp, err := clientDialingIP(internalLBIP, 800*time.Millisecond).Do(req) + if err != nil { + return + } + resp.Body.Close() +} + +type lokiPush struct { + Streams []lokiStream `json:"streams"` +} + +type lokiStream struct { + Stream map[string]string `json:"stream"` + Values [][2]string `json:"values"` +} diff --git a/cli/update_viktorbarzin_me.go b/cli/update_viktorbarzin_me.go index 1a693a25..c2c1d3f4 100644 --- a/cli/update_viktorbarzin_me.go +++ b/cli/update_viktorbarzin_me.go @@ -103,6 +103,6 @@ func notifyForIPChange(oldIP, newIP net.IP) error { if err != nil { return errors.Wrapf(err, "Error reading response") } - glog.Infof("Response:", string(responseBody)) + glog.Infof("Response: %s", string(responseBody)) return nil } diff --git a/cli/usage_test.go b/cli/usage_test.go new file mode 100644 index 00000000..052e080c --- /dev/null +++ b/cli/usage_test.go @@ -0,0 +1,18 @@ +package main + +import ( + "strings" + "testing" +) + +func TestUsageQuery(t *testing.T) { + got := usageQuery("30d", "") + want := `sum by (verb) (count_over_time({job="homelab-usage"}[30d]))` + if got != want { + t.Errorf("usageQuery(30d,\"\") = %q, want %q", got, want) + } + withUser := usageQuery("7d", "emo") + if !strings.Contains(withUser, `user="emo"`) || !strings.Contains(withUser, "[7d]") { + t.Errorf("usageQuery with user missing filter/range: %q", withUser) + } +} diff --git a/cli/woodpecker.go b/cli/woodpecker.go new file mode 100644 index 00000000..b3a48c20 --- /dev/null +++ b/cli/woodpecker.go @@ -0,0 +1,191 @@ +package main + +import ( + "context" + "encoding/json" + "fmt" + "io" + "net" + "net/http" + "os" + "os/exec" + "strings" + "time" +) + +// Woodpecker is reached at ci.viktorbarzin.me but routed via the internal Traefik +// LB (mirrors the proven `curl --resolve ci.viktorbarzin.me:443:10.0.20.203`): +// we dial the LB IP while keeping SNI/Host = the hostname so the cert verifies. +const ( + wpHost = "ci.viktorbarzin.me" + wpLBIP = "10.0.20.203" +) + +type wpClient struct { + base string + token string + http *http.Client +} + +// wpToken reads WOODPECKER_TOKEN, else the canonical Vault path. +func wpToken() string { + if t := firstEnv("WOODPECKER_TOKEN", "WP_TOKEN"); t != "" { + return t + } + out, err := exec.Command("vault", "kv", "get", "-field=woodpecker_api_token", "secret/ci/global").Output() + if err != nil { + return "" + } + return strings.TrimSpace(string(out)) +} + +func newWPClient() (*wpClient, error) { + tok := wpToken() + if tok == "" { + return nil, fmt.Errorf("no woodpecker token — set WOODPECKER_TOKEN or `vault login` (reads secret/ci/global)") + } + ip := firstEnv("HOMELAB_WP_IP") + if ip == "" { + ip = wpLBIP + } + dialer := &net.Dialer{Timeout: 8 * time.Second} + tr := &http.Transport{ + DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) { + if strings.HasPrefix(addr, wpHost+":") { + addr = ip + addr[strings.LastIndex(addr, ":"):] + } + return dialer.DialContext(ctx, network, addr) + }, + } + return &wpClient{base: "https://" + wpHost, token: tok, http: &http.Client{Timeout: 20 * time.Second, Transport: tr}}, nil +} + +// getJSON GETs path into v, retrying the transient empty/5xx responses the +// Woodpecker API intermittently returns under load. +func (c *wpClient) getJSON(path string, v interface{}) error { + var lastErr error + for attempt := 0; attempt < 5; attempt++ { + if attempt > 0 { + time.Sleep(2 * time.Second) + } + req, _ := http.NewRequest("GET", c.base+path, nil) + req.Header.Set("Authorization", "Bearer "+c.token) + resp, err := c.http.Do(req) + if err != nil { + lastErr = err + continue + } + body, _ := io.ReadAll(resp.Body) + resp.Body.Close() + if resp.StatusCode >= 500 || len(strings.TrimSpace(string(body))) == 0 { + lastErr = fmt.Errorf("woodpecker GET %s -> %d (empty/5xx, retrying)", path, resp.StatusCode) + continue + } + if resp.StatusCode >= 300 { + return fmt.Errorf("woodpecker GET %s -> %d: %s", path, resp.StatusCode, strings.TrimSpace(string(body))) + } + return json.Unmarshal(body, v) + } + return lastErr +} + +type wpPipeline struct { + Number int `json:"number"` + Status string `json:"status"` + Event string `json:"event"` + Commit string `json:"commit"` + Message string `json:"message"` +} + +func (c *wpClient) recentPipelines(repoID, n int) ([]wpPipeline, error) { + var ps []wpPipeline + err := c.getJSON(fmt.Sprintf("/api/repos/%d/pipelines?per_page=%d", repoID, n), &ps) + return ps, err +} + +// findPipeline returns the pipeline for commit (prefix match), or the latest when +// commit is empty. +func (c *wpClient) findPipeline(repoID int, commit string) (wpPipeline, error) { + ps, err := c.recentPipelines(repoID, 25) + if err != nil { + return wpPipeline{}, err + } + if len(ps) == 0 { + return wpPipeline{}, fmt.Errorf("no pipelines for repo %d", repoID) + } + if commit == "" { + return ps[0], nil + } + for _, p := range ps { + if strings.HasPrefix(p.Commit, commit) { + return p, nil + } + } + return wpPipeline{}, fmt.Errorf("no pipeline for commit %s in the last %d", commit[:min(8, len(commit))], len(ps)) +} + +func (c *wpClient) repoID() (int, error) { + owner, repo, err := repoOwnerName() + if err != nil { + return 0, err + } + var r struct { + ID int `json:"id"` + } + if err := c.getJSON("/api/repos/lookup/"+owner+"/"+repo, &r); err != nil { + return 0, err + } + if r.ID == 0 { + return 0, fmt.Errorf("repo %s/%s not registered in woodpecker", owner, repo) + } + return r.ID, nil +} + +// repoOwnerName derives <owner>/<repo> from the cwd git remote. +func repoOwnerName() (string, string, error) { + cwd, _ := os.Getwd() + root, err := gitRepoRoot(cwd) + if err != nil { + return "", "", fmt.Errorf("not in a git repository: %w", err) + } + remote := preferRemote(remotesOrEmpty(root)) + url, err := gitOutput(root, "remote", "get-url", remote) + if err != nil { + return "", "", err + } + return parseOwnerRepo(url) +} + +// parseOwnerRepo extracts owner/repo from an https or ssh git remote URL. +func parseOwnerRepo(url string) (string, string, error) { + u := strings.TrimSuffix(strings.TrimSpace(url), ".git") + u = strings.TrimSuffix(u, "/") + if i := strings.Index(u, "://"); i >= 0 { + u = u[i+3:] + } + u = strings.ReplaceAll(u, ":", "/") // git@host:owner/repo -> git@host/owner/repo + parts := strings.Split(u, "/") + if len(parts) < 2 || parts[len(parts)-1] == "" || parts[len(parts)-2] == "" { + return "", "", fmt.Errorf("cannot parse owner/repo from remote %q", url) + } + return parts[len(parts)-2], parts[len(parts)-1], nil +} + +func isTerminalStatus(s string) bool { + switch s { + case "success", "failure", "error", "killed", "declined", "blocked": + return true + } + return false +} + +func isFailureStatus(s string) bool { + return s == "failure" || s == "error" || s == "killed" || s == "declined" +} + +func min(a, b int) int { + if a < b { + return a + } + return b +} diff --git a/cli/woodpecker_test.go b/cli/woodpecker_test.go new file mode 100644 index 00000000..72c73c69 --- /dev/null +++ b/cli/woodpecker_test.go @@ -0,0 +1,40 @@ +package main + +import "testing" + +func TestParseOwnerRepo(t *testing.T) { + cases := []struct{ in, owner, repo string }{ + {"https://forgejo.viktorbarzin.me/viktor/infra.git", "viktor", "infra"}, + {"https://forgejo.viktorbarzin.me/viktor/infra", "viktor", "infra"}, + {"git@github.com:ViktorBarzin/infra.git", "ViktorBarzin", "infra"}, + {"https://github.com/ViktorBarzin/tripit/", "ViktorBarzin", "tripit"}, + } + for _, c := range cases { + o, r, err := parseOwnerRepo(c.in) + if err != nil || o != c.owner || r != c.repo { + t.Errorf("parseOwnerRepo(%q) = (%q, %q, %v), want (%q, %q)", c.in, o, r, err, c.owner, c.repo) + } + } + if _, _, err := parseOwnerRepo("nonsense"); err == nil { + t.Error("expected error for unparseable remote") + } +} + +func TestStatusClassification(t *testing.T) { + for _, s := range []string{"success", "failure", "error", "killed"} { + if !isTerminalStatus(s) { + t.Errorf("%q should be terminal", s) + } + } + for _, s := range []string{"running", "pending"} { + if isTerminalStatus(s) { + t.Errorf("%q should not be terminal", s) + } + } + if !isFailureStatus("failure") || !isFailureStatus("error") { + t.Error("failure/error should classify as failure") + } + if isFailureStatus("success") { + t.Error("success must not classify as failure") + } +} diff --git a/docs/adr/0001-android-emulator-in-cluster.md b/docs/adr/0001-android-emulator-in-cluster.md new file mode 100644 index 00000000..aec44e9d --- /dev/null +++ b/docs/adr/0001-android-emulator-in-cluster.md @@ -0,0 +1,42 @@ +--- +status: accepted +--- + +# The Android testing environment is a privileged KVM emulator pod in-cluster + +Viktor's apps are growing Android clients (first: tripit's Capacitor shell — +see tripit ADR-0013/0014), and agents need a native Android instance to test +changes against before shipping. All K8s nodes already run with CPU type +`host`, so `/dev/kvm` works inside the cluster. + +Decision (2026-06-11): one shared **Android 16 (API 36) Google-emulator +instance** runs as a privileged pod in namespace `android-emulator` +(stack `stacks/android-emulator`), with `/dev/kvm` via hostPath, adb exposed +LAN-only on the shared MetalLB IP (10.0.20.200:5555), and a noVNC screen view +at android-emulator.viktorbarzin.lan. The SDK/system-image/AVD live on a PVC; +the image is a slim manually-built shell. + +## Considered options + +- **devvm-local docker emulator** — rejected as the durable home: shared + 24GB workstation, ~13GB free disk, per-machine, not shared across agents. +- **Dedicated Proxmox VM** — rejected: burns scarce PVE host headroom 24/7 + and adds a whole VM lifecycle for one emulator. +- **redroid (container-native Android)** — rejected: requires binder kernel + modules on every node (documented binderfs incompatibilities), max + Android 15; most invasive for the least version coverage. +- **budtmo/docker-android** — rejected: turnkey but capped at Android 14; + the native features driving the Android work (Live Updates, background + GPS) are Android 16 behaviors, matching the real target device. +- **/dev/kvm device plugin instead of privileged** — deferred: a new + cluster component to avoid one namespace-scoped exclude-list entry; the + exclude pattern (kured/woodpecker/frigate/changedetection) already exists. + +## Consequences + +- `android-emulator` joins the Kyverno `security_policy_exclude_namespaces` + list (privileged allowed; registry policy also bypassed in-namespace). +- adb is unauthenticated by design — the LB IP must remain LAN-only. +- Single shared instance: concurrent agent sessions share Android state; + long destructive work should presence-claim `service:android-emulator`. +- Rendering is swiftshader (CPU) — the contended T4 stays out of the path. diff --git a/docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md b/docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md new file mode 100644 index 00000000..24f25a6c --- /dev/null +++ b/docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md @@ -0,0 +1,24 @@ +--- +status: accepted +date: 2026-06-12 +--- + +# All owned images build off-infra on GitHub Actions and live on ghcr.io + +In-cluster Woodpecker buildkit builds repeatedly hurt the homelab: registry-push load OOMKilled Forgejo (2026-06-09), buildkit→Forgejo pushes ride a flaky hairpin, build IO lands on the shared sdc HDD, and the Forgejo registry PVC sat at its 50Gi ceiling with retention stuck in DRY_RUN. We decided every owned image is built by GitHub Actions and hosted on ghcr.io, extending the tripit pilot (2026-06-09) to the whole fleet: Forgejo stays the canonical git host, a one-way push-mirror feeds a GitHub mirror, and the mirror's workflow builds, pushes, then POSTs Woodpecker's API to deploy. The Forgejo container registry is decommissioned as a build target — one manual cleanup pass keeps a last-known-good tag per Service, after which nothing pushes to it. + +## Considered options + +- **GHA builds pushing back into the Forgejo registry** — keeps images home and the pull path unchanged, but keeps the exact failure mode that motivated the move (Forgejo OOM under blob-push load), keeps the PVC growth, and keeps the circular dependency where the images needed to repair the cluster live inside the cluster. Rejected. +- **Per-repo in-cluster fallback builds** (the old `build-fallback.yml` pattern) — rejected in favour of a clean cut: a GitHub outage pauses image builds (running workloads are unaffected), and existing fallback files are deleted. The hedge against ghcr's "currently free" private storage ever being enforced is the visibility split (public images are permanently free) plus re-creating fallbacks if that day comes. +- **Paid builders (Docker Build Cloud, Depot)** — solve a multi-arch/persistent-cache problem this fleet doesn't have (everything is linux/amd64). Rejected. + +## Consequences + +- DR improves: images survive homelab loss, so a dead cluster can pull everything it needs to come back — the same doctrine that keeps the monorepo on GitHub ("Forgejo dies with the cluster"). +- Private ghcr pulls bypass the registry VM's pull-through cache (it can't authenticate), so cold-node pulls of private images depend on GitHub availability; public images cache normally. +- Visibility is decided per repo: public = generic tooling that passes a gitleaks/PII history scan; private = personal, financial, or legally-gray domains. A failed scan means the repo stays private — canonical history is never rewritten for publication. For interpreted languages repo visibility ≈ image visibility (the image ships the source). +- Only private-repo builds consume GitHub free-plan minutes (~12 builders, well under the 2,000/mo free tier; usage is reviewed after rollout wave 2 before considering Pro). +- Woodpecker becomes deploy-only; its agents never build. The Kyverno-synced `registry-credentials` stays (Forgejo git + frozen last-known-good images); a cluster-wide Kyverno-synced `ghcr-credentials` joins it. +- Builders with no live consumer (terminal-lobby, webhook-handler, hmrc-sync, trading-bot, travel-agent, trip-planner) are decommissioned rather than migrated; travel_blog is decommissioned outright (service + CI). Any revival adopts this ADR's pattern. +- Workflows build single-manifest images (`provenance: false`, linux/amd64 only) so registry retention never faces the orphaned-index-children failure class that broke Forgejo's cleanup. diff --git a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md new file mode 100644 index 00000000..9e0e2192 --- /dev/null +++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md @@ -0,0 +1,30 @@ +# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub + +Status: accepted (extends ADR-0002) + +## Context + +Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/<name>`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model. + +The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup. + +## Decision + +Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`: + +- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.** +- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. +- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge. +- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror." + +## Considered options + +- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference. +- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood. + +## Consequences + +- Divergence becomes structurally impossible — one push target per repo. +- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided. +- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.) +- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging. diff --git a/docs/adr/0004-homelab-unified-cli.md b/docs/adr/0004-homelab-unified-cli.md new file mode 100644 index 00000000..27cce02a --- /dev/null +++ b/docs/adr/0004-homelab-unified-cli.md @@ -0,0 +1,30 @@ +# homelab: a unified infra-ops CLI grown in place from infra/cli + +Agents re-derive the same operational command boilerplate every session — mining +51,116 bash commands across 2,225 past sessions showed dense, repeated patterns +(the infra inner-loop alone is ~29%). We are building `homelab`, one CLI encoding +the deterministic, repeated **actions** (not judgment) agents run — composable in +bash, JSON-capable, and discovered progressively via `homelab manifest`. It is +grown **in place** in `cli/` (the existing `infra-cli`), absorbing new verb-groups +alongside the preserved legacy webhook use-cases. Versioned with a `cli/VERSION` +file (the infra repo deploys continuously and does not cut semver tags). + +## Considered options + +- **Its own top-level repo** (the original plan) — rejected in favour of keeping + it where the Terraform/Terragrunt and `scripts/tg` it drives already live; the + Go source isn't git-crypt-encrypted and a provision-time build is unaffected by + GitOps continuous-deploy. +- **A fresh CLI ignoring infra-cli** — rejected: strands the VPN/DNS/email + webhook use-cases. +- **Raw kubectl/tg/ssh + skills + MCP only** — kept for everything outside the + recurring action surface (methodology skills; third-party/owned MCP such as + phpIPAM, which homelab does NOT duplicate). + +## Consequences + +- The binary is dual-purpose: the agent-facing `homelab` verb surface AND the + in-cluster `infra-cli` webhook image. `main()` front-dispatches homelab verbs + and falls through to the legacy `-use-case` path verbatim. +- Distribution: built from source to `/usr/local/bin/homelab` during devvm + provisioning (`t3-dispatch` precedent), refreshed by `t3-autoupdate`. diff --git a/docs/adr/0005-homelab-v01-scope.md b/docs/adr/0005-homelab-v01-scope.md new file mode 100644 index 00000000..c1da7a95 --- /dev/null +++ b/docs/adr/0005-homelab-v01-scope.md @@ -0,0 +1,23 @@ +# homelab v0.1 scope: the infra inner-loop; everything allowed, tiers recorded + +v0.1 ships only the highest-volume surface — the infra inner-loop: `work` +(worktree lifecycle), `tf` (terragrunt via `scripts/tg` + fmt/validate/ +force-unlock), and `claim`/`release` (presence) — because it is ~29% of all mined +commands and where agents lose the most time and leak the most presence claims. + +v0.1 enforces **no** homelab-level permission gating: everything is allowed, +relying on existing gates (harness permission mode, presence claims, plan +approval). But every verb records a `read|write` tier (visible in `manifest`), so +a PreToolUse classifier hook (auto-allow reads / prompt writes) can be added +later with zero restructuring. + +## Considered options + +- **Reads-first vertical slice** (top read verb per domain) — lower risk, broad + value, but defers the toil that motivated the project. +- **One domain deep (k8s)** — cleanest template, narrow day-one value. + +We chose the highest-volume-but-write-heavy infra loop deliberately, accepting +the extra complexity (worktree lifecycle, git-crypt flag injection, presence +coupling, branch-protection PR fallback) for the biggest immediate toil +reduction. k8s/node/secret/net/ci verb-groups are deferred to later versions. diff --git a/docs/adr/0006-homelab-work-and-tf.md b/docs/adr/0006-homelab-work-and-tf.md new file mode 100644 index 00000000..fcdddc30 --- /dev/null +++ b/docs/adr/0006-homelab-work-and-tf.md @@ -0,0 +1,29 @@ +# homelab work/tf behaviour: native worktree entry, gated auto-land, presence-coupled apply + +Four behaviours of the infra-loop verbs are surprising enough to record: + +1. **`work` owns worktree create/land/clean, but session *entry* delegates to the + native harness worktree tool.** A CLI is a child process and cannot change the + agent's working directory; `EnterWorktree` can. So `homelab work start <topic>` + creates the worktree + branch off `<remote>/master` (git-crypt-aware) and + prints the path — the agent enters it with native `EnterWorktree({path})`. + +2. **`work land` is auto-land, but gated on verification.** It merges master in → + runs verification → pushes `HEAD:master` (fetch+merge+retry on + non-fast-forward) → falls back to pushing the feature branch for a PR when the + direct push is rejected (branch protection). It **refuses to push when it + cannot verify** (no `--verify-cmd` and no auto-detected suite) unless + `--no-verify` is passed — added after an accidental smoke-test land pushed + unverified WIP to master (benign: the infra CI applied 0 stacks because the + diff was `cli/`-only, but an unverified land must be deliberate, not default). + +3. **`tf apply` is first-class despite GitOps, and mandatorily presence-coupled.** + Local applies are out-of-band (CI applies canonically on push) but happen + constantly (~763× in the corpus). `tf apply <stack>` auto-claims `stack:<name>`, + delegates to `scripts/tg apply --non-interactive`, and **always releases on + exit** (normal, error, or signal via `sync.Once` + handler) — fixing the + documented ~200-claim leak — and prints an out-of-band reminder. + +4. **Known v0.1 limitation:** `work land` does not yet block on CI to green; that + arrives with the ci/deploy watch verb-group. It prints a reminder to follow + the pipeline manually. diff --git a/docs/adr/0007-homelab-k8s-verbs.md b/docs/adr/0007-homelab-k8s-verbs.md new file mode 100644 index 00000000..422b3431 --- /dev/null +++ b/docs/adr/0007-homelab-k8s-verbs.md @@ -0,0 +1,30 @@ +# homelab k8s verb-group: app→pod resolver, read/write split, config-mutation stays raw + +v0.2 adds the Kubernetes verb-group — the biggest remaining surface by far +(mining the post-v0.1 corpus: 11,291 `kubectl` commands across 243 sessions, more +than every other domain combined). + +It is built on an **app→namespace→pod resolver**: most namespaces hold exactly +one app, so `<app>` defaults to the namespace, and the target defaults to +`deploy/<app>` (kubectl resolves a pod from the Deployment). `-n`/`--pod`/`-c`/ +`-l`/`--tty` override; multi-pod namespaces (`dbaas`, `monitoring`) need +specificity. The CLI uses the ambient kubeconfig — no per-call auth flags. + +Verbs: read — `status`, `get`, `logs`, `describe`, `debug` (one-shot triage), +`pf`, `rollout-status`; write/operational — `db`, `exec`, `restart`, `rm-pod`. + +## Decisions worth recording + +- **Config-mutation verbs are deliberately NOT exposed** (`apply`/`edit`/`patch`/ + `scale`/`create`). They stay raw `kubectl`, by design, per the repo's + Terraform-only policy — the corpus confirms they're low-frequency, and a + friendly verb would normalise a policy violation. +- **`rm-pod` is restricted to pods/jobs only** — deleting Deployments/STS/PVCs is + config mutation and forbidden; the verb cannot target them. +- **`db` encodes the dbaas exec pattern** (the single highest-value k8s + sub-pattern, ~886 dbaas ops): PG via `pg-cluster-rw -c postgres`, + `psql -U postgres -d <app>`; MySQL via `mysql-standalone-0` with a + `bash -c 'mysql -p"$MYSQL_ROOT_PASSWORD" …'` wrapper so the password comes from + the pod env and never appears on the command line. +- Read verbs were smoke-tested against the live cluster; write verbs are + unit-tested (resolver, db-plan, shell-quoting) but not fired at live state. diff --git a/docs/adr/0008-homelab-memory-verbs.md b/docs/adr/0008-homelab-memory-verbs.md new file mode 100644 index 00000000..60f13850 --- /dev/null +++ b/docs/adr/0008-homelab-memory-verbs.md @@ -0,0 +1,30 @@ +# homelab memory verb-group: direct HTTP client to claude-memory; MCP deprecation path + +v0.3 adds the memory verb-group so agents can search and navigate memory from the +CLI. `claude-memory` is a FastAPI service (Postgres-backed, `Bearer`-auth, +ingress `auth = "none"` so programmatic clients work) — the **MCP is just one +frontend over it**. `homelab memory` is a thin HTTP client over the same API, +using the env the hooks already set (`CLAUDE_MEMORY_API_URL` + +`CLAUDE_MEMORY_API_KEY`; defaults to the ingress). Because it talks to the HTTP +API directly, it **works even when the MCP frontend is down** — the recurring +MCP-disconnect problem that motivated claude-memory HA (and that took the MCP +offline for the entire session this was built in). + +Verbs: `recall` (server-side semantic ranking), `list`, `categories`, `tags`, +`stats`, `secret` (read); `store`, `update`, `delete` (write). Validated against +the live API including a store→recall→delete round-trip — full data-plane parity +with the MCP. + +## Deprecation path (deliberate follow-up — NOT done in v0.3) + +The MCP is more than tools: the **per-prompt auto-recall hook** and the +**auto-learn hook** run on every prompt for every agent. Deprecating it safely is +a separate, sequenced change: + +1. Rewire the auto-recall hook to `homelab memory recall` and the auto-learn hook + to `homelab memory store`. +2. Update the CLAUDE.md memory policy to point at the CLI. +3. Uninstall the MCP. + +Done CLI-first (verbs proven before touching the every-prompt path) so a +regression can't silently break auto-recall/auto-learn fleet-wide. diff --git a/docs/adr/0009-homelab-ci-deploy-verbs.md b/docs/adr/0009-homelab-ci-deploy-verbs.md new file mode 100644 index 00000000..51399997 --- /dev/null +++ b/docs/adr/0009-homelab-ci-deploy-verbs.md @@ -0,0 +1,29 @@ +# homelab ci/deploy verbs: API-based watch, internal-LB dialer, work-land integration + +v0.4 adds `ci`/`deploy` — the biggest *reasoning* sink in agent sessions (watching +a build/deploy to completion), proven during the session that built it (hours +spent hand-rolling Woodpecker API polling, DB-schema reverse-engineering, and +retrigger logic for a single CI incident). + +## Decisions + +- **API, not DB.** The verbs query the Woodpecker REST API (version-stable), + not its Postgres schema (which drifts across upgrades — column renames bit us + mid-incident). Reached via the internal Traefik LB by dialing `10.0.20.203` + while keeping SNI/Host = `ci.viktorbarzin.me` so the cert verifies (the Go + equivalent of the house `curl --resolve` pattern). Token from + `WOODPECKER_TOKEN` or Vault `secret/ci/global`; repo id resolved from the cwd + git remote via `/api/repos/lookup/<owner>/<repo>`. +- **Retries are mandatory.** The Woodpecker API intermittently returns empty/5xx + under load (it flapped through the whole build session); `getJSON` retries + empties with backoff so `ci watch` is reliable exactly when it's needed. +- **`work land` now waits for CI.** After pushing, `work land` calls `ci watch` + on the landed commit and fails if the pipeline does — closing the gap ADR-0005 + deferred. `--no-ci-watch` opts out. +- **`deploy wait` encodes the "rollout status lies" rule:** it first waits for + the deployment image to reference the expected sha, *then* blocks on rollout + status (kubectl-based; reuses the k8s helpers). +- **`ci logs` deferred to v0.4.1.** Woodpecker's per-pipeline detail/log + endpoints were the least reliable this session (often empty); `status`/`watch` + rely on the list endpoint that works. A DB-backed `ci logs` is a possible + follow-up if the API path stays flaky. diff --git a/docs/adr/0010-homelab-net-obs-verbs.md b/docs/adr/0010-homelab-net-obs-verbs.md new file mode 100644 index 00000000..29a94a46 --- /dev/null +++ b/docs/adr/0010-homelab-net-obs-verbs.md @@ -0,0 +1,37 @@ +# homelab net/dns/metrics/logs verbs: endpoint resolution as the unit of value + +v0.5 adds `net`/`dns`/`metrics`/`logs`. These were chosen against an explicit +test the user posed mid-build: *does the verb save reasoning, or only typing?* A +wrapper over a command already known fluently (plain `ssh`, `vault kv get`) saves +keystrokes but not thought. These four save thought — the reasoning they encode +is **which endpoint, reached how, with what auth/URL shape** — re-derived every +time otherwise. (That same test deprioritized `node ssh` aliasing and `secret +get`, which are thin wrappers; see the session discussion.) + +## Decisions + +- **Internal ingresses, reached via the LB.** Everything routes through the + Traefik LB by dialing `10.0.20.203` with the URL host preserved as SNI — the + Go form of the house `curl --resolve host:443:10.0.20.203` pattern + (`probe.go: clientDialingIP`). Verified live before building: Prometheus + (`prometheus-query.viktorbarzin.lan`) and Loki (`loki.viktorbarzin.lan`) both + answer JSON over the LB with **no auth gate and no port-forward** — so these + stay clean HTTP clients, not kubectl wrappers. +- **`net check` is two-legged on purpose.** It resolves the host via public DNS + (→ Cloudflare) AND dials the internal LB, reporting both — because the useful + question is *where* a break is (CF edge vs the app vs the LB path), which a + single curl can't answer. The external leg forces public resolution (the devvm + resolver is split-horizon and would otherwise hit the LB for both). +- **`metrics alerts` uses the `ALERTS` series, not `/api/v1/alerts`.** + `prometheus-query.*` is a query-only frontend (404 on `/api/v1/alerts`), and + Alertmanager has no LB ingress (the alert-digest reads it in-cluster). Firing + alerts are exposed as the synthetic `ALERTS{alertstate="firing"}` time series, + queryable through the working endpoint — so no new dependency. +- **Deliberately NOT built:** in-cluster-only endpoints (Alertmanager v2, + raw `*.svc` services) that would force port-forward/`kubectl run`. The + reasoning-savings there don't beat the added moving parts; kept out of scope. +- **No `node`/`secret` group.** Same test: their high-volume parts are + command-wrappers (low savings); only compound node ops (serial console, VM + wait, fan-out) would qualify, and those are lower-frequency. Left unbuilt + unless a concrete pain surfaces — the high-value deterministic surface + (tf/work/ci/k8s/memory + these probes) is now covered. diff --git a/docs/adr/0011-homelab-usage-telemetry.md b/docs/adr/0011-homelab-usage-telemetry.md new file mode 100644 index 00000000..c383211b --- /dev/null +++ b/docs/adr/0011-homelab-usage-telemetry.md @@ -0,0 +1,34 @@ +# homelab usage telemetry: evidence-driven verb prioritization, privacy by construction + +v0.6 adds `usage top` plus a fire-and-forget emit on every dispatched verb. It +exists to answer the question that drove the whole CLI — *which verbs are worth +adding next* — with data instead of one maintainer's habits (the earlier mining +covered a single user's ~51k commands, so the surface is shaped to that user). + +## Decisions + +- **Emit on dispatch, in `dispatch()`.** The longest-prefix match already knows + the verb path; after `Run` returns we emit `{verb, exit}`. Discovery verbs + don't go through `dispatch()` (`manifest`/`version`/`help` are handled in + `dispatchTop`), so they don't self-record; `usage *` is skipped explicitly so + the analytics reader doesn't pollute its own data. +- **Payload is deliberately minimal: verb path + exit code only.** Labels + `{job=homelab-usage, user, verb}` (all low-cardinality) + line `exit=N ver=X`. + **No args, paths, flags, hostnames, or secrets** ever leave the process — the + emit sees only the matched verb name, not the arguments. This is what makes + cross-user aggregation safe. +- **Shared Loki sink → cross-user analytics WITHOUT reading homes.** Each user's + CLI writes its own invocations (attributed to its OS user) to the shared Loki + push API via the Traefik LB (verified: HTTP 204, no auth). `usage top` reads + back with a LogQL metric query. This is the privacy-preserving resolution to + "what does everyone (e.g. another user) use" — it never touches anyone's + `~/.claude`, which the org per-user policy bars (see the per-user red-line in + managed-settings; reading another user's home is off-limits even for an owner + in-session — a fresh session under changed MDM policy is the only legitimate + path, and even then this telemetry is the better answer). +- **Best-effort, never affects the command.** All errors swallowed; an 800ms + client timeout bounds the cost; opt-out via `HOMELAB_TELEMETRY=0`. Telemetry + must never slow or break the tool it measures. +- **Loki, not a new datastore.** Zero new infra, and it dogfoods the v0.5 `logs` + path (same host, same LB dial). Presence MySQL was the alternative (queryable + SQL) but would add a write dependency and creds; Loki needs neither. diff --git a/docs/adr/0012-homelab-ha-verbs.md b/docs/adr/0012-homelab-ha-verbs.md new file mode 100644 index 00000000..379f8ee5 --- /dev/null +++ b/docs/adr/0012-homelab-ha-verbs.md @@ -0,0 +1,54 @@ +# homelab Home Assistant verbs: token resolution + host SSH, not entity control + +v0.7 adds `ha token` and `ha ssh`. They were chosen by mining a heavy HA +operator's sessions: across ~1,900 shell commands the single most-repeated line +(420×) was a hand-rolled `kubectl … | base64 -d | python -c '…token'` pipeline, +and a bespoke `ssh -o StrictHostKeyChecking=no -o …` invocation was redefined as +a shell function ~30× — both re-derived from scratch every session. The existing +`home-assistant-sofia.py` already covers the *API*, but it goes unused from an +arbitrary cwd (it needs `HOME_ASSISTANT_SOFIA_TOKEN` set and is referenced by a +cwd-relative path), so agents bypassed it. A global verb on `$PATH` closes that +gap for every user in every directory. + +## Decisions + +- **Only the two gaps the `ha` MCP can't fill.** The `ha` MCP server already + does entity state and control (`get_state`, `call_service`, history, logs). + Per the CLI's founding rule — *MCP-encoded actions are out of scope* (ADR-0004) + — we do **not** reimplement `on`/`off`/`list`/`state`. We add only token + *resolution* and host *SSH*, neither of which an API-only MCP can provide. The + value is endpoint/secret/host resolution, exactly like `net`/`dns` (ADR-0010). +- **`ha token` resolves live from the cluster, not from an env var.** It reads + the dedicated k8s Secret `openclaw/ha-tokens` (one key per instance: `sofia` / + `london`) via the ambient kubeconfig. This is robust to env drift — the precise + failure that made agents re-derive the pipeline. Read-tier, prints the bare + token to stdout so it composes in `$(…)`, mirroring `memory secret`. +- **The token is split into its own least-privilege secret** (`stacks/openclaw/ha_tokens.tf`). + It was originally read from `openclaw-secrets` → `skill_secrets` (a JSON blob + also holding `slack_webhook` + `uptime_kuma_password`), which only cluster + admins can read — so the verb hung/failed for the non-admin operator it was + built for (emo = `emil.barzin@gmail.com`, group `Home Server Admins`, whose + OIDC identity is barred from secrets in `openclaw`). `ha-tokens` carries only + the HA tokens, with a Role+RoleBinding granting `get` on *just that secret* to + the `Home Server Admins` group (k8s RBAC can't scope to a JSON sub-key, hence + the separate object). openclaw's own deployment keeps reading `openclaw-secrets` + — this is purely additive. +- **`ha ssh` is deterministic and per-user.** Flags are fixed for unattended + use: `-F /dev/null` (ignore user ssh-config), `StrictHostKeyChecking=no` + + `UserKnownHostsFile=/dev/null` (no host-key prompt/record — agents have no + TTY), `BatchMode=yes` + `ConnectTimeout=10` (fail fast, never hang). The key + is the **invoking user's** `~/.ssh/id_ed25519`, so the verb isn't tied to + whoever first wrote the workflow; that user's key must be enrolled on the HA + host. Write-tier (runs an arbitrary remote command). +- **sofia is the default; london is structural.** The devvm sits on the Sofia + LAN, so `vbarzin@192.168.1.8` is reachable and is the default instance. london + (`hassio@192.168.8.103`) is in the instance map so `ha token --instance london` + works (a pure secret read), but `ha ssh --instance london` generally won't + connect from here — london is remote. We model it correctly rather than + pretend it's reachable. +- **Scope held at two verbs.** `ha api` (an authenticated curl passthrough for + the endpoints the MCP/script don't cover — `/api/template`, `/reload`, + `check_config`, `/error_log`) was deferred: once `ha token` exists, raw curl is + already unblocked, and a generic passthrough overlaps the MCP. Re-measure via + `usage top` (ADR-0011); add targeted sugar verbs only if those endpoints are + still hand-rolled often. diff --git a/docs/adr/0013-homelab-browser-verbs.md b/docs/adr/0013-homelab-browser-verbs.md new file mode 100644 index 00000000..bba4e8e7 --- /dev/null +++ b/docs/adr/0013-homelab-browser-verbs.md @@ -0,0 +1,75 @@ +# homelab browser verbs: headful (anti-bot) web automation via cluster Chrome + +v0.8 adds `browser run`, `browser open`, and `browser --help`. They package a +capability that already existed but was undiscoverable: driving the cluster's +**headful** Chrome (`chrome-service` — real Chrome under Xvfb, CDP on +`svc/chrome-service:9222`) from the devvm, for sites that detect and block +headless automation. + +## Motivating incident (2026-06-22) + +Logging a washing-machine repair on the Stirling Ackroyd **Fixflo** tenant +portal: the headless `@playwright/mcp` browser loaded the site and filled the +entire multi-step form, but the **final submit silently failed** — Fixflo's +pre-submit `POST /IssuePreCreationCheck` returned `net::ERR_FILE_NOT_FOUND`, the +spinner hung, no issue was created. Root cause = headless-Chrome detection. The +fix was to drive the headful `chrome-service` over `connect_over_cdp` — it +submitted first try (Fixflo ref IS22657587). That capability was documented +(`docs/architecture/chrome-service.md`) but **not packaged or discoverable**, so +it took ~40 min, three redundant full form re-runs, and a user hint. The agent +also misread `ERR_FILE_NOT_FOUND` as "network egress" and retried blind instead +of inspecting the network panel. + +## Decisions + +- **Mechanics in `homelab`, not a `~/.claude` skill.** A standalone skill was + rejected: the CLI is run every session (so the verb is *discoverable*), is + versioned, multi-user, and test-covered. A private, untested skill is none of + those. The command owns only the deterministic *mechanics* (port-forward, + stealth injection, lifecycle) — the agent supplies the Playwright script, so + *judgment* stays out of the CLI (the founding rule, ADR-0004/0005). +- **The failure was judgment, not setup friction**, so the CLI is paired with a + one-line pointer in always-in-context `~/code/CLAUDE.md` and a diagnostic + payload in `browser --help`: the *when-to-use* signature (a site loads but a + gated action fails/hangs, or one request 500s/aborts while siblings 200 → + suspect headless detection) and an error-code cheat-sheet (`ERR_FILE_NOT_FOUND` + = request resolved/intercepted by the automation layer, **not** egress; + egress failures are `ERR_CONNECTION_REFUSED`/`_TIMED_OUT`/`_NAME_NOT_RESOLVED` + and would break the page load too). A command the agent doesn't think to run is + useless; the cheat-sheet is the actual fix for the misdiagnosis. +- **Reach the pod via `kubectl port-forward`, then `connect_over_cdp` to + localhost.** port-forward tunnels API-server→pod, so it **bypasses the `:9222` + NetworkPolicy** that gates in-cluster callers — the devvm needs no namespace + label. Readiness is asserted against `/json/version`: the endpoint must report + a real `Chrome/…`, never `HeadlessChrome` (the whole point). The forward is + **always** torn down (process-group kill + signal handler), on success and on + error — an acceptance requirement. +- **Default to a fresh incognito context; `--shared-context` opts into the warmed + profile.** chrome-service is a single shared browser with a persistent profile. + A fresh, always-closed context is safe for concurrent callers (tripit's fare + scrape connects per-quote) and is what production already does. The warmed + persistent profile (cookies from a manual noVNC login) is opt-in for flows that + need a pre-logged-in session. +- **Pin the node CDP client to `playwright-core@1.48.2`** to match the + chrome-service image minor (`mcr.microsoft.com/playwright:v1.48.0-noble`, + Chromium 130). `connect_over_cdp` speaks the browser's CDP, and protocol + changes between Playwright minors — the devvm's ambient Python Playwright was + 1.58, a 10-minor skew. The pin makes behaviour deterministic across the fleet + regardless of local drift. `playwright-core` (not `playwright`) because no + browser binary is needed — we connect to the remote one. +- **Self-provision the client lazily, no per-user setup.** The pinned client is + installed once into `~/.cache/homelab/browser-client/` (idempotent, version- + guarded) on first use, alongside the embedded runner + stealth files. node is + already fleet-wide; this avoids coupling the feature to a provisioner change + and keeps it self-contained and self-healing. The client runs on the devvm, so + `setInputFiles` streams local files to the remote browser over CDP — no + `chmod`/staging-dir workaround on the CDP path. +- **Vendor `stealth.js`, guard against drift.** The CLI embeds a byte-for-byte + copy of `stacks/chrome-service/files/stealth.js` (the source of truth the + in-cluster callers use) via `go:embed`; a unit test fails if the copy drifts. + `go:embed` can't reach outside the package dir, hence the vendored copy rather + than a path reference. +- **Scope held at two action verbs + help.** `run` (arbitrary script — the + workhorse) and `open` (navigate + title/text/screenshot — a quick check) cover + the surface. Both are write-tier; the bare `browser`/`--help` is read. Re-measure + via `usage top` (ADR-0011) before adding more. diff --git a/docs/adr/0014-service-identity-and-east-west-observability.md b/docs/adr/0014-service-identity-and-east-west-observability.md new file mode 100644 index 00000000..5eb1c83a --- /dev/null +++ b/docs/adr/0014-service-identity-and-east-west-observability.md @@ -0,0 +1,29 @@ +--- +status: accepted +date: 2026-06-24 +--- + +# Service identity is namespace + label; east-west observability via Calico Goldmane; no service mesh + +As the Service count grows we want an audit-grade record of which Service talks to which — the "service mesh evaluation" `docs/plans/2026-04-20-infra-audit-design.md` flagged as never done ("worth a design doc even if the answer is no, too much complexity for the gain"). We evaluated the full design space against two constraints: the trust model is a single-tenant cluster needing **attribution-grade** forensics (reconstruct events in a cluster we trust), not cryptographic non-repudiation against a hostile pod; and we are acutely **etcd-constrained** (we removed VPA/Goldilocks for exactly this, and carry open beads `code-oflt`/`code-at4f` on etcd starvation). Decision: **service identity = the workload's namespace** (primary; Goldmane stamps it natively and "one Service ≈ one namespace" holds for ~87 of our namespaces), refined by an explicit `service-identity` label only in the few genuinely multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`). **East-west observability = Calico 3.30 Goldmane + Whisker** (already in our Calico v3.30.7, currently `enabled = false` in `stacks/calico/main.tf`), with Goldmane's emitter shipping flows to **Loki** for a durable trail. **Enforcement reuses the existing Wave 1 observe-then-enforce egress track**, now selecting on namespace/label and fed by Goldmane's allow/deny + policy-trace flows. We explicitly **reject** a service mesh, mTLS, SPIFFE/SPIRE, and dedicated per-Service ServiceAccounts for now. + +## Considered options + +- **Dedicated per-Service ServiceAccount as the identity primitive** — initially chosen, then reversed. 56% of pods (257/458) run as `default`, so it is a ~116-stack rollout; and Goldmane (the chosen flow source) carries pod/namespace/workload **labels but no ServiceAccount field**, so SAs would not even reach the audit trail without a custom pod→SA mapping. The cheaper, etcd-inert path (namespace is free; a handful of static labels) delivers the same attribution. Deferred until identity-aware NetworkPolicy needs a principal finer than namespace/label, or mTLS is adopted. +- **Service mesh (Istio / Linkerd / Cilium-mesh) + mTLS + SPIFFE/SPIRE** — the only thing that makes the trail cryptographically non-repudiable against a hostile pod. The trust model does not justify it, east-west stays single-tenant plaintext, and it is precisely the "too much complexity for the gain" the audit doc predicted. Rejected. +- **Microsoft Retina (CNI-agnostic eBPF)** — more capable (DNS, drops, Hubble UI) and GA, runs on Calico without a CNI change. But identity-rich mode writes **one `RetinaEndpoint` CRD per pod to etcd** (continuous, pod-proportional churn — the exact axis we guard), and it is metrics-first, not log-first (no per-flow Loki records without custom glue). Rejected for this use case; noted as the fallback if DNS/drop-level detail is ever needed. +- **Cilium Hubble** — reads Cilium's eBPF datapath maps; unusable on Calico without migrating the CNI. A CNI migration is not justified. Rejected. +- **Kiali** — builds its graph entirely from an Istio mesh's Prometheus telemetry; no mesh, no graph. Rejected. +- **Custom Grafana Alloy enrichment exporter over raw iptables-`LOG` flow lines** — Alloy has no IP→identity dictionary-lookup primitive (`loki.process` lacks a lookup stage; `k8sattributes` can't do per-line/dual-IP association), so this is a multi-day custom build that also has to beat pod-IP churn. Goldmane delivers identity-stamped flows natively and obviates it. Rejected. +- **Kyverno generate+mutate to provision/assign identity** — rejected on etcd grounds: background scans + PolicyReports + UpdateRequests are continuous writes, the VPA-class cost we shed. Identity stays static. + +## Consequences + +- **No etcd cost from the flow plane.** Goldmane streams flows from Felix (the existing `calico-node` DaemonSet) over gRPC into a ~60-minute in-memory ring buffer — nothing written to etcd or the K8s API. Steady-state cost is two Deployments (`goldmane`, `whisker`) + RAM/CPU on the goldmane pod. +- **The ring buffer is not a trail.** Durable, queryable history depends on the emitter→Loki path (reuse the 90-day security-stream retention); on a Goldmane restart the in-memory window is lost. +- **Goldmane is tech-preview** in OSS Calico 3.30 — the main risk. Enabling it is a reversible toggle in `stacks/calico/main.tf`, but the toggle interacts with the operator-managed Installation CR (only namespaces are TF-adopted today; verify how Goldmane/Whisker are enabled before applying). +- **Attribution is namespace-grained for free** across ~87 single-Service namespaces. Multi-Service namespaces (`monitoring`, `kube-system`, `dbaas`) need a `service-identity` label to disambiguate; most are platform/infra and already on the Wave 1 enforcement exclude list. +- **The trail is attribution-grade, not cryptographic.** It reliably reconstructs events in a trusted cluster but cannot prove identity against a pod that spoofs its source — an accepted limit of the trust model. This ADR does not change the east-west encryption posture (still plaintext, no mTLS). +- **Enforcement gains a better data source.** Goldmane's allow/deny + policy-trace flows build the Wave 1 empirical egress allowlist faster than the current iptables-`LOG`→journald→Loki path, and policies select on namespace/label with no SA dependency. +- **New ubiquitous language** recorded in `CONTEXT.md`: **Service identity** and **Goldmane / Whisker**. +- **Revisit triggers:** adopt dedicated per-Service SAs if identity-aware NetworkPolicy needs a principal finer than namespace/label, or if mTLS is ever required; reconsider Retina if DNS/drop-level flow detail becomes necessary. diff --git a/docs/architecture/authentication.md b/docs/architecture/authentication.md index 6806cd35..9decc8dc 100644 --- a/docs/architecture/authentication.md +++ b/docs/architecture/authentication.md @@ -40,10 +40,10 @@ graph TB | Component | Version | Location | Purpose | |-----------|---------|----------|---------| -| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (2 replicas) | +| Authentik Server | 2026.2.2 | `stacks/authentik/` | Core IdP application servers (3 replicas) | | Authentik Worker | 2026.2.2 | `stacks/authentik/` | Background task processors (2 replicas) | | PgBouncer | Latest | `stacks/authentik/` | PostgreSQL connection pooler (3 replicas) | -| Embedded Outpost | - | Built into Authentik | Forward auth endpoint for Traefik | +| Embedded Outpost | - | Standalone deployment, managed by Authentik | Forward auth endpoint for Traefik (2 replicas, PG-backed sessions) | | Traefik ForwardAuth | - | `modules/kubernetes/ingress_factory/` | Middleware attached when `auth = "required"` or `"public"` | | Vault OIDC Method | - | `stacks/vault/` | Human SSO authentication to Vault | | Vault K8s Auth | - | `stacks/vault/` | Service account JWT authentication | @@ -64,15 +64,36 @@ Services pick an auth tier via the `auth` enum on the `ingress_factory` module ( When `auth = "required"`, an unauthenticated request flows: 1. Request hits Traefik ingress -2. ForwardAuth middleware calls Authentik embedded outpost -3. Authentik checks for valid session cookie +2. ForwardAuth middleware calls the `auth-proxy` nginx (basicAuth fallback when Authentik is down), which proxies to the Authentik embedded outpost over a keepalive connection pool +3. Authentik checks for valid session cookie (domain-level `authentik_proxy_*` cookie on `.viktorbarzin.me`, 4-week validity — one cookie covers all forward-auth apps) 4. If missing/invalid, redirects to Authentik login page (authentik.viktorbarzin.me) -5. User authenticates via social provider (Google/GitHub/Facebook) +5. User authenticates on a **single screen**: username + password together (the identification stage embeds the password stage), or a social provider button (Google/GitHub/Facebook), then MFA validation 6. Authentik creates session, sets cookie, redirects back to original URL 7. Subsequent requests include session cookie, pass auth check, reach backend Authentik adds authentication headers (user, email, groups) to forwarded requests. These headers are stripped before reaching the backend to prevent confusion. +### First-time signin performance (2026-06-10) + +Signin latency is dominated by screen count and round trips, not server time +(DB avg 1.6ms). Standing decisions: + +- **Single-screen login**: the identification stage carries `password_stage`, + so username+password is one round trip. The separate password-stage binding + was removed from `default-authentication-flow` (required by authentik when + embedding). Pinned in TF: `authentik_stage_identification.default_identification`. +- **Implicit consent everywhere**: all OIDC providers are first-party, so none + use the explicit-consent flow (it re-prompted every 4 weeks per app). +- **Live tuning via `server.env`/`worker.env`** (the `authentik.*` Helm values + are inert due to `existingSecret`): 3 gunicorn workers, 30m flow-plan cache, + 15m policy cache, 60s persistent DB connections. +- **Static assets cached immutable**: `/static` ingress carve-out adds + `Cache-Control: public, max-age=31536000, immutable` (assets are + version-fingerprinted; authentik itself sends no max-age). +- **Outpost**: 2 replicas, `log_level=info` (was 1 replica at `trace`). +- **auth-proxy nginx**: upstream `keepalive 32` + HTTP/1.1 — no per-request + TCP setup on the forward-auth subrequest path. + **Anti-exposure guard**: every `auth = "app"` or `auth = "none"` line MUST have a preceding `# auth = "<tier>": <reason>` comment documenting what gates the backend (for `"app"`) or why the endpoint is intentionally public (for `"none"`). The convention is enforced by `scripts/check-ingress-auth-comments.py`, which `scripts/tg` runs on every `plan/apply/destroy/refresh` and blocks the terragrunt invocation if violated. Stack-scoped — each stack documents itself. ### Social Login & Invitation Flow diff --git a/docs/architecture/automated-upgrades.md b/docs/architecture/automated-upgrades.md index 5d8b1c9e..c0200d84 100644 --- a/docs/architecture/automated-upgrades.md +++ b/docs/architecture/automated-upgrades.md @@ -4,7 +4,7 @@ This doc covers three independent automation paths: 1. **Service-level upgrades** — Container image bumps for OSS apps (DIUN → n8n → claude-agent → Terraform). Most of this doc. 2. **OS-level upgrades on K8s nodes** — `unattended-upgrades` + `kured` with sentinel-gate + Prometheus halt-on-alert. See "K8s Node OS Upgrades" section and the runbook at `docs/runbooks/k8s-node-auto-upgrades.md`. -3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — weekly detection CronJob → chain of phase Jobs (preflight → master → worker × 4 → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`. +3. **K8s component version upgrades** (kubeadm/kubelet/kubectl) — daily detection CronJob → chain of phase Jobs (preflight → master → one worker Job per worker, enumerated live → postflight). See "K8s Version Upgrades" section and the runbook at `docs/runbooks/k8s-version-upgrade.md`. ## Overview @@ -252,7 +252,7 @@ kubeadm/kubelet/kubectl bumps (patch + minor) on all 5 K8s VMs. ### Architecture ``` -k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns) +k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns) │ probe apt-cache madison kubeadm (master) → latest available patch │ probe HEAD https://pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor? │ push k8s_upgrade_available metric to Pushgateway @@ -262,20 +262,26 @@ envsubst on /template/job-template.yaml | kubectl apply -f - │ spawns Job 0 = k8s-upgrade-preflight-<target_version> ▼ -Job 0 — preflight (pinned: k8s-node1) -Job 1 — master upgrade (pinned: k8s-node1) drains k8s-master -Job 2 — worker (pinned: k8s-node1) drains k8s-node4 -Job 3 — worker (pinned: k8s-node1) drains k8s-node3 -Job 4 — worker (pinned: k8s-node1) drains k8s-node2 -Job 5 — worker (pinned: k8s-master) drains k8s-node1 ← control-plane toleration -Job 6 — postflight (no pinning) +Job 0 — preflight (pinned: first worker) +Job 1 — master upgrade (pinned: first worker) drains k8s-master +Job 2..N — worker (pinned: k8s-master) drains each worker still off-target + ← control-plane toleration; one Job + per worker, enumerated live from + `kubectl get nodes` (covers node5/6 + + any future node automatically) +Job N+1 — postflight (no pinning) ``` Each Job runs `scripts/upgrade-step.sh`, which dispatches on `$PHASE` and ends by spawning the next Job (`envsubst < /template/job-template.yaml | kubectl apply -f -`). Job names are deterministic (`k8s-upgrade-<phase>-<target_version>[-<node>]`) -so `apply` reconciles to a single Job per run — re-running a failed Job -won't duplicate downstream Jobs. +so `apply` reconciles to a single Job per run — re-running won't duplicate +downstream Jobs. The detection CronJob and `spawn_next` additionally delete + +re-spawn a terminally-**Failed** Job of the same name (rather than skipping it +on existence), so a transient preflight gate self-heals on the next cycle +instead of wedging the pipeline until the dead Job's 7d TTL expires +(retry-on-failure, added 2026-06-17 after a spurious critical alert stalled +1.34.9 for 5 days). ### Self-preemption history (the reason for the Job-chain rewrite) @@ -304,11 +310,16 @@ each Job's pod and its drain target are always different nodes. ConfigMap, and a `template` ConfigMap into each Job pod. - **Per-node script**: `infra/scripts/update_k8s.sh`. Caller passes `--role master|worker --release X.Y.Z`. Piped via SSH into each node by - upgrade-step.sh. -- **Three Upgrade Gates alerts**: + upgrade-step.sh. The master path runs `kubeadm upgrade apply` with + `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins + --skip-phases=addon/coredns` so kubeadm never touches CoreDNS (custom Corefile + + separately-tracked image; CoreDNS is pinned off Keel via `keel.sh/policy=never`). + See the runbook's "CoreDNS is NOT upgraded by kubeadm here". +- **Four Upgrade Gates alerts**: - `K8sVersionSkew` — kubelet/apiserver `gitVersion` count >1 for 30m. Catches a half-done rollout. - `EtcdPreUpgradeSnapshotMissing` — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight failing silently. - `K8sUpgradeStalled` — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a chain Job dying without spawning its successor. + - `K8sUpgradeChainJobFailed` — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that terminally failed **before `in_flight` was set** (the preflight gates exit pre-metric) — invisible to the two `in_flight`-based alerts above; this was the blind spot behind the 5-day 1.34.9 preflight wedge. Reason-scoped so a retry-success doesn't false-positive (and so it doesn't needlessly block kured). The `unless k8s_upgrade_blocked == 1` clause (2026-06-21) excludes a deliberate compat-gate refusal (owned by `K8sUpgradeBlocked`) so a block doesn't double-fire as a wedge. - **Pushgateway metrics**: - `k8s_upgrade_in_flight` (set in preflight, cleared in postflight) - `k8s_upgrade_snapshot_taken` (set after etcd snapshot Job completes with ≥1 KiB) @@ -334,7 +345,7 @@ The cluster has a single control plane (no HA). A failed `kubeadm upgrade apply` - **Mandatory etcd snapshot before every run** (even patch). Recovery point if master breaks. - **Halt-on-alert before every drain**. Reuses the same Prometheus ignore-list regex kured uses — any unrelated cluster-health alert blocks. Three gate alerts catch upgrade-specific half-states (version skew, missing snapshot, stalled chain). -- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target. k8s-node1 hosts every Job except the one that drains it (which runs on k8s-master with a control-plane toleration). +- **Job pinning eliminates self-preemption**. Each Job's pod runs on a node that is NOT its drain target: the master-drain Job runs on the first worker; every worker-drain Job runs on k8s-master (already upgraded, control-plane toleration). The worker set is enumerated live from `kubectl get nodes`, so new nodes are covered with no script change; SSH targets are node InternalIPs (no DNS dependency). - **Sequential workers with 10-min inter-node soak**. Same risk-bounding as the 24h OS-reboot soak, but tightened because kubelet failures surface within minutes — not hours. - **Master upgrade goes first, workers last**. If master breaks, the cluster is already degraded so further worker upgrades would just delay recovery. By upgrading master first, we either succeed (workers can roll afterward) or fail loud (operator triages before any worker is touched). - **No auto-rollback**. kubeadm doesn't support clean downgrade; the snapshot + manual apt rollback in the runbook is the recovery path. diff --git a/docs/architecture/backup-dr.md b/docs/architecture/backup-dr.md index 60c1c77d..c4d0092f 100644 --- a/docs/architecture/backup-dr.md +++ b/docs/architecture/backup-dr.md @@ -77,6 +77,8 @@ The **bypass list** (leg 2) is just `/srv/nfs/immich/` — too big for sda (1.5 - `Synology/Backup/Viki/nfs/` — immich only (post-2026-05-26) - `Synology/Backup/Viki/nfs-ssd/` — **immich-ML only (2026-06-01)**; ollama/llamacpp dropped (re-pullable models, live-only on the SSD) +**VM image backups (added 2026-06-09)**: the hand-managed Linux VMs (those NOT in Terraform — see `compute.md`) were historically **not imaged at all** — only their *contents* reached backup if they happened to host a PVC/NFS path. `vzdump-vms` now takes a daily live `vzdump --mode snapshot` of each configured VMID → `/mnt/backup/vzdump/` (Copy 2), carried offsite by the monthly offsite-sync full pass (Copy 3). **Currently enabled for VMID 102 (devvm)** — the shared workstation, whose per-user home dirs + local-only git repos are otherwise irreplaceable. Extend via `VZDUMP_VMIDS` in the unit. See "VM Image Backups (vzdump)" under How It Works. + ## Architecture Diagram ### Data Routing — where each path goes (post-2026-05-26) @@ -208,13 +210,14 @@ graph LR T0000["00:00 LVM thin snapshots<br/>(lvm-pvc-snapshot)<br/>sdc PVCs CoW"] T0015["00:15 PostgreSQL per-DB dumps<br/>(CronJob)"] T0045["00:45 MySQL per-DB dumps<br/>(CronJob)"] + T0100["01:00 vzdump-vms<br/>live image of hand-managed VMs<br/>(devvm) → sda /mnt/backup/vzdump/"] T0200["02:00 nfs-mirror (daily)<br/>sdc /srv/nfs/* → sda /mnt/backup/<svc>/<br/>~10-20 min steady state"] T0500["05:00 daily-backup<br/>mount LVM snapshots ro<br/>rsync PVC files → /mnt/backup/pvc-data/<br/>+ sqlite + pfsense + pve-config"] T0600["06:00 offsite-sync-backup<br/>Step 1: sda → Synology /Viki/pve-backup/<br/>Step 2: sdc/immich + nfs-ssd → /Viki/nfs[-ssd]/"] T1200["12:00 LVM thin snapshots (midday)<br/>second daily snapshot"] end - T0000 --> T0015 --> T0045 --> T0200 --> T0500 --> T0600 --> T1200 + T0000 --> T0015 --> T0045 --> T0100 --> T0200 --> T0500 --> T0600 --> T1200 INO -.->|change events feed Step 2| T0600 style Nightly fill:#ffe0b2 @@ -322,6 +325,7 @@ graph LR | NFS Change Tracker | Continuous (inotifywait) | PVE host: `nfs-change-tracker.service` | Logs changed NFS file paths to `/mnt/backup/.nfs-changes.log` | | pfSense Backup | Daily 05:00 + daily-backup | PVE host: SSH + API | config.xml + full filesystem tar | | Offsite Sync | Daily 06:00 (after daily-backup) | PVE host: `offsite-sync-backup` | Two-step: sda→pve-backup + NFS→nfs/nfs-ssd via inotify | +| VM Image Backup (vzdump) | Daily 01:00, keep 3 | PVE host: `vzdump-vms` | Live `vzdump` of hand-managed VMs (devvm) → `/mnt/backup/vzdump/` | | PostgreSQL Backup (full) | Daily 00:00, 14d retention | CronJob in `dbaas` namespace | pg_dumpall for all databases | | PostgreSQL Backup (per-db) | Daily 00:15, 14d retention | CronJob in `dbaas` namespace | pg_dump -Fc per database → `/backup/per-db/<db>/` | | MySQL Backup (full) | Daily 00:30, 14d retention | CronJob in `dbaas` namespace | mysqldump --all-databases | @@ -352,6 +356,20 @@ Native LVM thin snapshots provide crash-consistent point-in-time recovery for 62 **Restore**: `lvm-pvc-snapshot restore <pvc-lv> <snapshot-lv>` — auto-discovers K8s workload, scales down, swaps LVs, scales back up. See `docs/runbooks/restore-lvm-snapshot.md`. +### VM Image Backups (vzdump) + +The hand-managed Linux VMs are **intentionally not in Terraform** (telmate/bpg provider bugs — see `compute.md`) and were historically **not imaged at all**: nothing took a whole-disk backup of the VM itself. For most that is acceptable — k8s nodes are reprovisioned from cloud-init and their data lives in PVCs covered above. But **devvm** (the shared multi-user Claude Code workstation, VMID 102) holds irreplaceable state that lives nowhere else: per-user home dirs (`~/.claude`, `~/.t3`, shell history), manually-installed tooling, and **local-only git repos** — the monorepo root at `/home/wizard/code` has no git remote. A lost devvm disk = unrecoverable. + +**Script**: `/usr/local/bin/vzdump-vms` on PVE host (source: `infra/scripts/vzdump-vms.sh`). Deploy: `scp infra/scripts/vzdump-vms.sh root@192.168.1.127:/usr/local/bin/vzdump-vms` + `scp infra/scripts/vzdump-vms.{service,timer} root@192.168.1.127:/etc/systemd/system/`, then `systemctl daemon-reload && systemctl enable --now vzdump-vms.timer`. +**Schedule**: Daily 01:00 via systemd timer — ahead of the other backup jobs so the fresh image is on sda before offsite-sync runs. +**Mode**: `vzdump --mode snapshot` — live, no downtime. devvm has the qemu guest agent enabled (`agent: 1`), so the snapshot is **filesystem-consistent** (fs-freeze) rather than merely crash-consistent. Runs `Nice=10` + `IOSchedulingClass=idle` + `--ionice 7` so it never starves etcd on the contended sdc IO domain. +**Scope**: VMIDs in `VZDUMP_VMIDS` (default `102` = devvm). Add VMIDs there to image other hand-managed VMs. +**Retention**: `KEEP=3` newest dumps per VMID on sda (`/mnt/backup/vzdump/`); each devvm image is ~35-50 GB zstd. +**Critical dependency**: `nfs-mirror` MUST keep `--exclude='/vzdump/'`. Its nightly `rsync -rlt --delete /srv/nfs/ → /mnt/backup/` treats any `/mnt/backup` dir with no `/srv/nfs` counterpart as an orphan and deletes it — this silently reaped the first two vzdump images at 02:00 on 2026-06-10 before the exclude was added (same reason `pvc-data`/`pfsense`/`pve-config`/`sqlite-backup` are excluded). +**Offsite**: deliberately **NOT** appended to the incremental offsite manifest — it never deletes, so daily multi-GB images would accumulate unbounded on Synology. Instead the **monthly offsite-sync full pass (days 1-7)** mirrors all of `/mnt/backup` (including `vzdump/`) to Synology with `--delete`, bounded to local retention. So Copy 2 (sda) refreshes **daily**; Copy 3 (Synology) refreshes **monthly**. +**Monitoring**: pushes `vzdump_last_run_timestamp` / `vzdump_last_status` / `vzdump_last_success_timestamp` to Pushgateway job `vzdump-backup`. Alerts `VzdumpBackupStale` (>~50h since last success), `VzdumpBackupNeverRun`, `VzdumpBackupFailing` (status≠0) are defined in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (the 3-2-1 group) — **effective on the next `monitoring` stack apply** (metrics already flow, so the alerts arm immediately once applied). +**Restore**: on the PVE host, `qmrestore /mnt/backup/vzdump/vzdump-qemu-<vmid>-<ts>.vma.zst <vmid>` — restore to a spare VMID first if the original still exists, then swap disks; or use the PVE UI (add `/mnt/backup` as a dir storage with content=backup → Restore). + ### Layer 2: Weekly File-Level Backup (sda Backup Disk) **Backup disk**: sda (1.1TB RAID1 SAS) → VG `backup` → LV `data` → ext4 → mounted at `/mnt/backup` on PVE host. Dedicated backup disk, independent of live storage. @@ -527,12 +545,16 @@ The btrfs cleaner thread reclaims async — `df` may lag the snapshot-delete by | `/usr/local/bin/lvm-pvc-snapshot` | PVE host: LVM snapshot creation + restore | | `/usr/local/bin/daily-backup` | PVE host: PVC file copy + auto SQLite backup + pfSense | | `/usr/local/bin/offsite-sync-backup` | PVE host: two-step rsync to Synology (sda + NFS via inotify) | +| `/usr/local/bin/vzdump-vms` | PVE host: daily live `vzdump` image of hand-managed VMs (devvm) → `/mnt/backup/vzdump/` | | `/mnt/backup/` | PVE host: sda mount point (1.1TB backup disk) | +| `/mnt/backup/vzdump/` | PVE host: vzdump VM images (keep 3 per VMID), mirrored offsite monthly | | `/mnt/backup/.nfs-changes.log` | NFS change log from inotifywait, consumed by offsite-sync | | `/etc/systemd/system/nfs-change-tracker.service` | inotifywait watcher for `/srv/nfs` + `/srv/nfs-ssd` | | `/etc/systemd/system/lvm-pvc-snapshot.timer` | Daily 03:00 (LVM snapshots) | | `/etc/systemd/system/daily-backup.timer` | Daily 05:00 (file backup) | | `/etc/systemd/system/offsite-sync-backup.timer` | Daily 06:00 (offsite sync) | +| `/etc/systemd/system/vzdump-vms.timer` | Daily 01:00 (VM image backup) | +| `/etc/systemd/system/vzdump-vms.service` | oneshot: `vzdump-vms` (source `infra/scripts/vzdump-vms.{sh,service,timer}`) | | `/usr/local/bin/nfs-mirror` | PVE host: daily 02:00 mirror of /srv/nfs/* → sda /mnt/backup/<svc>/ (Layer 3a) | | `/etc/systemd/system/nfs-mirror.timer` | Daily 02:00 (NFS local mirror to sda) | | `stacks/dbaas/` | Terraform: PostgreSQL/MySQL backup CronJobs | @@ -911,6 +933,9 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at | Uptime Kuma | ✓ | ✓ | — | ✓ | proxmox-lvm | | **Other apps not enumerated above** | ✓¹ | ✓¹ | varies | ✓ | proxmox-lvm / proxmox-lvm-encrypted | | **Postiz** (bundled bitnami PG on local-path) | — | — | ✓ daily pg_dump → NFS | ✓ | local-path + NFS | +| **Hand-managed VMs (not in Terraform)** | +| devvm (workstation, VMID 102) | — | — | ✓ daily vzdump image | ✓ monthly | local-lvm (sdc) | +| Other hand-managed VMs (HA 103, registry 220, k8s nodes) | — | — | — gap² | — | local-lvm — see note² | | **Media (NFS)** | | Immich (~800GB) | — | — | — | ✓ | NFS | | Audiobookshelf | — | — | — | ✓ | NFS | @@ -924,6 +949,8 @@ the 2026-04-22 backup_offsite_sync FAIL (node3 kubelet hiccup at **Note**: All proxmox-lvm and proxmox-lvm-encrypted PVCs get LVM snapshots (except `dbaas` and `monitoring` namespaces, excluded for write-amplification reasons) + file-level backup. NFS-backed media syncs directly to Synology `nfs/` and `nfs-ssd/` via inotify change tracking. +² **Hand-managed VMs** — only **devvm (102)** is imaged today (`vzdump-vms`, `VZDUMP_VMIDS=102`). The k8s nodes are deliberately uncovered (reprovisioned from cloud-init; their data lives in the PVCs already backed up above). **home-assistant (103) and docker-registry (220) are a documented gap** — add their VMIDs to `VZDUMP_VMIDS` to image them (registry content is also re-pullable from upstreams; HA has its own add-on backups). pfSense (101) is covered separately by `daily-backup` (config.xml + weekly tar). + ¹ **"Other apps not enumerated above"** — the table only enumerates services worth calling out. The default backup posture for any service using `proxmox-lvm` or `proxmox-lvm-encrypted` (outside `dbaas`/`monitoring`) is **automatic** Layer 1 (LVM thin snapshots, 7d retention) + Layer 2 (file backup, 4 weekly versions on sda) + Layer 3 (offsite to Synology). Auto-discovery is by LV name pattern (`vm-*-pvc-*`), so adding a new service to the cluster gets it covered without any explicit registration. Run `ssh root@192.168.1.127 lvs --noheadings -o lv_name pve | grep '^vm-.*-pvc-' | grep -v _snap_ | wc -l` to see the live count. **Known gaps** — services with PVCs not on the proxmox-lvm path lose Layer 1+2: diff --git a/docs/architecture/chrome-service.md b/docs/architecture/chrome-service.md index b70fe185..6f9c1ee4 100644 --- a/docs/architecture/chrome-service.md +++ b/docs/architecture/chrome-service.md @@ -10,9 +10,14 @@ serves two distinct populations: `chromium.connect_over_cdp("http://chrome-service.chrome-service.svc:9222")` to drive a real browser when upstream anti-bot trips a headless one (`disable-devtool.js` redirect-to-google trap, `navigator.webdriver` - checks, console-clear timing tricks). The only currently-active - in-cluster caller is the `chrome-service-snapshot-harvester` CronJob; - the `stacks/f1-stream/files/backend/playback_verifier.py` + + checks, console-clear timing tricks). Currently-active in-cluster + callers: the `chrome-service-snapshot-harvester` CronJob, and + **tripit's `PlaywrightFareProvider`** (since 2026-06-11, tripit issue + #18 / ADR-0007) — the flight-fare scrape connects per quote, opens a + fresh incognito context, scrapes Google Flights, and closes the + context; rate-limited to one attempt per 30s with a 6h fare cache, so + browser load is negligible. The + `stacks/f1-stream/files/backend/playback_verifier.py` + `chrome_browser.py` tree is a vestigial design — the deployed f1-stream image (built from `github.com/ViktorBarzin/f1-stream`) does not use this code path. @@ -107,17 +112,32 @@ External caller (dev box): @playwright/mcp --isolated --storage-state ~/.cache/...storage-state.json ``` +## Browser binary — real Google Chrome (for proprietary codecs) + +The chrome-service container runs **real Google Chrome**, not the bundled +Chromium, via the infra-owned image `ghcr.io/viktorbarzin/chrome-service-browser` +(`files/chrome/Dockerfile` = `mcr.microsoft.com/playwright:v1.48.0-noble` + +`google-chrome-stable`, built by `.github/workflows/build-chrome-service-browser.yml`). +The launch resolves `CHROMIUM=/opt/google/chrome/chrome`. + +**Why:** the Playwright-bundled Chromium has proprietary codecs **compiled out**, +so H.264/AAC video (Instagram Reels, X, most `.mp4`) fails in the noVNC view with +`MEDIA_ERR_SRC_NOT_SUPPORTED` (the bytes download `200 video/mp4` but there's no +decoder — NOT a GPU issue). Royalty-free codecs (VP9/VP8/AV1 → YouTube) always +worked. Swapping `libffmpeg.so` does NOT help (codecs are compiled out, not just +the lib stripped) and Chrome-for-Testing is also codec-less — only +`google-chrome-stable` carries them. + ## Image pin -Both the server image (`mcr.microsoft.com/playwright:v1.48.0-noble` in -`stacks/chrome-service/main.tf`) and the Python client -(`playwright==1.48.0` in callers' `requirements.txt`) **must match -minor-versions**. Bump in lockstep — Playwright protocol changes between -minors and the client cannot connect to a mismatched server. - -The harvester + snapshot-server sidecar use -`mcr.microsoft.com/playwright/python:v1.48.0-noble` — same playwright -minor, with Python-side bindings pre-installed. +The Playwright base + the Python client (`playwright==1.48.0` in callers' +`requirements.txt`) and the snapshot sidecars +(`mcr.microsoft.com/playwright/python:v1.48.0-noble`) historically had to match +minor-versions. The chrome-service browser is now real Google Chrome (a newer +milestone than the 1.48 Chromium), but the `connect_over_cdp` callers (tripit +fare scrape, `homelab browser`, snapshot-harvester) attach over raw CDP, which is +version-tolerant — verified working against this Chrome. If a future Chrome +milestone breaks a caller, pin Chrome in the Dockerfile or bump the clients. ## Storage @@ -162,7 +182,29 @@ minor, with Python-side bindings pre-installed. `x11vnc` (connected to Xvfb on `localhost:6099`) bridged to `websockify` on port 6080. Service `chrome` maps :80 → :6080 and is exposed via `ingress_factory` at `chrome.viktorbarzin.me`, - Authentik-gated. + Authentik-gated. The bare host serves `vnc.html` (image symlinks + `index.html → vnc.html`); add `?autoconnect=true&resize=scale&path=websockify` + to skip the Connect button. The view is **black when no browser window is + open** (idle) — that is normal, not a failed connection. Chrome is launched + with `--window-size=1280,720 --window-position=0,0` to fill the Xvfb screen + (no window manager runs, so without it Chrome opens at its profile-persisted + size and the rest of the framebuffer shows as a black cut-off). + +### noVNC fd-sweep gotcha (stuck "Connecting") + +If the noVNC client hangs on **"Connecting" forever then times out**, the cause +is almost always x11vnc's fd-table sweep: containerd grants pods +`RLIMIT_NOFILE = 2^31`, and x11vnc `fcntl`-sweeps the **entire** fd table on +every client connection, so the RFB handshake never completes (websockify +accepts the WS and logs `connecting to: localhost:5900`, but x11vnc never sends +the `RFB 003.008` banner). Diagnose: `grep "open files" /proc/$(pgrep -n +x11vnc)/limits` (huge = bad) and time the handshake from a sibling container +(`python3 -c "import socket;s=socket.socket();s.connect(('127.0.0.1',5900));print(s.recv(12))"` — +healthy <0.3s, broken hangs). **Fix: cap `ulimit -n 65536` before x11vnc starts** +— done both in `files/novnc/entrypoint.sh` (root) and via the container `command` +wrapper in `main.tf` (so it applies deterministically even though the image is +`:latest`/`IfNotPresent` and won't re-pull a rebuilt entrypoint). Same bug + fix +as the android-emulator stack. - **snapshot-server sidecar** (`mcr.microsoft.com/playwright/python:v1.48.0-noble`) serves `GET /api/snapshot` from `/profile/snapshots/storage-state.json`, bearer-gated by `PW_TOKEN`. Service `chrome-snapshot` maps :8088 → :8088 @@ -175,6 +217,45 @@ minor, with Python-side bindings pre-installed. See `stacks/chrome-service/README.md` for the recipe (label namespace, inject `CHROME_CDP_URL`, vendor `stealth.js`). +## Driving from OUTSIDE the cluster (`homelab browser`) + +Agents on the devvm reach this browser through the **`homelab browser`** CLI +(`cli/`, ADR-0013) — the packaged, discoverable form of the ad-hoc +`connect_over_cdp` recipe. It is the **escalation path, not the default**: +agents default to the Playwright MCP / headless browser for all routine +automation, and reach for `homelab browser` ONLY when headless is blocked — a +site loads but a gated action (submit/login) silently fails or hangs, the +signature of headless / anti-bot detection. (Same tiered rule lives in +`~/code/CLAUDE.md` and `homelab browser --help`.) + +```text +devvm: homelab browser run flow.js + │ kubectl port-forward svc/chrome-service :9222 (random local port) + ▼ + http://127.0.0.1:<port> ──► chrome-service pod :9222 (CDP) + │ assert /json/version Browser is "Chrome/…", not "HeadlessChrome" + │ node + playwright-core@1.48.2 → connectOverCDP + │ context.addInitScript(stealth.js) ← same vendored file as in-cluster + │ run the user's Playwright script with page/context/browser in scope + └─ port-forward always torn down (success or error) +``` + +Key facts: + +- **port-forward bypasses the `:9222` NetworkPolicy.** It tunnels + API-server→pod, so the devvm needs no `chrome-service.viktorbarzin.me/client` + label — unlike in-cluster callers. +- **Client pinned to the image minor.** The node client is + `playwright-core@1.48.2` (matches `v1.48.0-noble` / Chromium 130), installed + lazily into `~/.cache/homelab/browser-client/`. Bump it in lockstep when the + server image bumps (same rule as the in-cluster Python clients — see "Image + pin" above). +- **Default context is a fresh incognito one** (closed on exit), safe for the + shared browser; `--shared-context` reuses the warmed persistent profile. +- **`stealth.js` is vendored** into the CLI (`cli/browser_stealth.js`) as a + byte-identical copy of `files/stealth.js`, guarded by a drift test — so the + CLI's stealth never diverges from the in-cluster callers'. + ## Limits + risks - **Anti-bot vs stealth arms race** — when an upstream beats us (DRM diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index 8a5990b6..35e041e6 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -2,306 +2,378 @@ ## Overview -The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`. +**Doctrine (ADR-0002): all image builds and CI compute run OFF-infra.** Every +owned image is built, tested, and linted on **GitHub Actions** (free on public +repos; 2000 free min/mo on private) and pushed to **`ghcr.io/viktorbarzin/<name>`**. +Woodpecker is **deploy-only** — a GHA job POSTs its API with the freshly-built +image tag and Woodpecker runs `kubectl set image` from inside the cluster. +There are **no in-cluster image builds or CI test runs anywhere** — the +in-cluster Woodpecker buildkit and the fallback-build pattern were removed as a +clean cut (ADR-0002, 2026-06-13). The Forgejo container registry is **frozen +and emptied** — break-glass only. + +This breaks the old circular dependency (images needed to repair the cluster +used to be built and stored *inside* it) and keeps build IO + registry pushes +off the homelab spindle. ## Architecture Diagram ```mermaid graph LR - A[Git Push] --> B[GitHub Actions] - B --> C[Build Docker Image<br/>linux/amd64, 8-char SHA tag] - C --> D[Push to DockerHub] - D --> E[POST Woodpecker API] - E --> F[Woodpecker Pipeline] - F --> G[Vault K8s Auth<br/>SA JWT] - G --> H[kubectl set image] - H --> I[K8s Deployment] - I --> J[Pull from DockerHub<br/>or Pull-Through Cache] + A[git push Forgejo<br/>viktor/<repo> canonical] --> B[push-mirror sync_on_commit] + B --> C[GitHub mirror<br/>ViktorBarzin/<repo>] + C --> D[GitHub Actions<br/>.github/workflows/build.yml] + D --> E[lint / test] + E --> F[buildx linux/amd64<br/>provenance:false] + F --> G[push ghcr.io/viktorbarzin/<name><br/>:sha8 + :latest] + G --> H[svu tag -> Forgejo canonical] + G --> I[POST Woodpecker deploy repo] + I --> J[.woodpecker/deploy.yml<br/>event: manual] + J --> K[kubectl set image<br/>in-cluster SA cluster-admin] + K --> L[K8s Deployment<br/>pulls from ghcr] - K[Pull-Through Cache<br/>10.0.20.10] -.-> J - L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J - - style B fill:#2088ff - style F fill:#4c9e47 - style K fill:#f39c12 + style D fill:#2088ff + style J fill:#4c9e47 + style G fill:#f39c12 ``` ## Components -| Component | Version | Location | Purpose | -|-----------|---------|----------|---------| -| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub | -| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster | -| DockerHub | Cloud | `viktorbarzin/*` | Public image registry | -| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 | -| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries | -| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces | -| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines | +| Component | Location | Purpose | +|-----------|----------|---------| +| GitHub Actions | `.github/workflows/build.yml` (per repo) | Build + lint + test + push image; trigger deploy; cut semver tag | +| ghcr.io | `ghcr.io/viktorbarzin/*` | Container registry for ALL owned images (public + private packages) | +| Woodpecker CI | `ci.viktorbarzin.me` | **Deploy-only** — `kubectl set image` in-cluster; plus infra applies + maintenance crons | +| Forgejo | `forgejo.viktorbarzin.me/viktor/<repo>` | **Canonical** git source (push-mirrors to GitHub). Container registry **FROZEN** (break-glass only) | +| Pull-Through Cache | `10.0.20.10:5000/5010/5020/5030/5040` | LAN cache for upstream registries (DockerHub, ghcr, Quay, k8s.gcr, Kyverno) | +| Kyverno | `kyverno` namespace | Syncs `ghcr-credentials` (private-ghcr allowlist) + `registry-credentials` to namespaces | +| Vault | `vault.viktorbarzin.me` | K8s auth for Woodpecker deploy pipelines; CI tokens in `secret/ci/global` + `secret/viktor` | ## How It Works -### Build Flow (GitHub Actions) +### The fleet pattern (every owned app) -1. **Trigger**: Git push to main/master branch -2. **Build**: GHA builds Docker image for `linux/amd64` platform only -3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`) - - `:latest` tags are **never used** to prevent stale pull-through cache issues -4. **Push**: Image pushed to DockerHub public registry -5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA +1. **Canonical source = Forgejo** `viktor/<repo>`. A **push-mirror** + (`sync_on_commit`) pushes every commit to the GitHub mirror + `ViktorBarzin/<repo>`. The `.github/workflows/build.yml` is committed on + Forgejo and mirrors over. +2. **GHA `build` job** (triggers `on: push: branches: [master]` ONLY — feature + branches mirror but build/deploy nothing, the safety valve): + - lint + test + - `svu` computes the next `vX.Y.Z` from conventional commits and pushes the + tag back to **canonical Forgejo** (GHA secret `FORGEJO_GIT_TOKEN` = + write:repository PAT); `VERSION` is baked into the image + - `docker buildx` `linux/amd64`, **`provenance: false`** (single-manifest — + avoids the orphaned-index-children failure class), push + `ghcr.io/viktorbarzin/<name>:<sha8>` + `:latest` + - `delete-package-versions` keeps the newest ~10 ghcr versions +3. **GHA `deploy` job** POSTs `ci.viktorbarzin.me/api/repos/<id>/pipelines` + (the Woodpecker registration for the **GitHub mirror**, github-forge; GHA + secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + `IMAGE_NAME`. +4. **`.woodpecker/deploy.yml`** (event: **manual** only, so the raw + Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set + image deployment/<app> <container>=<image>` in-cluster. The `woodpecker-agent` + SA is `cluster-admin`, so the `bitnami/kubectl` step needs no + kubeconfig/RBAC. The Deployment image is in `lifecycle.ignore_changes` + (`KEEL_IGNORE_IMAGE`) so the SHA tag sticks and `terragrunt apply` doesn't + fight it. CronJobs in owned apps track `:latest` + `imagePullPolicy: Always` + instead of a deploy step. -### Deploy Flow (Woodpecker CI) +**Keel stays enrolled** as a redundant net (finds the deployed SHA already +running → no-op). -1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA -2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth -3. **Deploy**: `kubectl set image deployment/<name> <container>=viktorbarzin/<app>:<sha>` -4. **Notify**: Slack notification on success/failure +**Tooling**: `infra/scripts/offinfra-onboard` + `infra/scripts/offinfra-templates/` +scaffold a repo onto this pattern (mirror, workflow, Woodpecker deploy repo, +old-pipeline removal, default-branch flip). Mirror + workflow commits go via +the Forgejo API over the internal Traefik LB +(`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm +can't reach Forgejo's public hairpin. -### Project Migration Status +### ghcr package visibility -**Migrated to GHA (8 projects)**: -- Website -- k8s-portal -- claude-memory-mcp -- apple-health-data -- audiblez-web -- plotting-book -- insta2spotify -- book-search (audiobook-search) +| Visibility | Packages | Pull mechanism | +|------------|----------|----------------| +| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous | +| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson | -**Woodpecker-native owned-app builds** (build + push to the Forgejo private -registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel -stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`. -`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on -2026-06-05 (Woodpecker repo id 166); the old github source is archived and its -GHA-era Woodpecker repo (id 10) is deactivated. +Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the +kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit +**ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source +`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault +`secret/viktor/ghcr_pull_token` (a dedicated classic PAT scoped to +`read:packages`, UI-minted 2026-06-15 — no longer the admin `github_pat` alias. +GitHub has no token-mint API, so rotation is manual: re-mint the classic +`read:packages` PAT → `vault kv patch secret/viktor ghcr_pull_token=…` → +targeted apply `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault; +avoids the git-crypt `tls-secret-sync` landmine on a locked clone), which +Kyverno then re-syncs to the allowlisted namespaces). -**Woodpecker-only (infra + large apps)**: -- `travel_blog`: 5.7GB content directory exceeds GHA limits -- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli) +### Migrated apps (issues #13–#27) -### Woodpecker Pipeline Files +f1-stream, job-hunter, tuya_bridge, beadboard, nextcloud-todos, +claude-agent-service, claude-memory-mcp, kms-website, Freedify, +instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`), +fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original +pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website, +k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, +audiobook-search) now also land on ghcr. -Each project contains: -- `.woodpecker/deploy.yml`: kubectl set image + Slack notification -- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires) +### Infra-owned images (issues #29 / #30) -### Woodpecker Repository IDs +Images owned by the infra repo build on GHA workflows **in the infra repo's own +`.github/workflows/`** (the github↔forgejo divergence was deliberately NOT +reconciled — the workflows were added to the GitHub lineage via PR): -Woodpecker API uses numeric IDs (not owner/name): +| Image | Workflow | Destination | +|-------|----------|-------------| +| chrome-service-novnc | `build-chrome-service-novnc.yml` | public `ghcr.io/viktorbarzin/chrome-service-novnc` | +| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` | +| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` | +| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` | -| Repo | ID | -|------|------| -| infra | 1 | -| Website | 2 | -| finance | 3 | -| health | 4 | -| travel_blog | 5 | -| webhook-handler | 6 | -| audiblez-web | 9 | -| plotting-book | 43 | -| claude-memory-mcp | 78 | -| infra-onboarding | 79 | +**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and +`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is +already built by tripit's GHA → ghcr. -### Image Registry Flow +The Woodpecker `build-ci-image.yml` and `build-cli.yml` pipelines were +**REMOVED**. Break-glass for infra-ci is now a manual +`.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM). -1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10` -2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss -3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access -4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries. -5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem). +### Forgejo container registry — FROZEN -### Infra Pipelines (Woodpecker-only) +Issue #32 wiped all `viktor/*` container packages (~19G reclaimed, `/data` +58%→20%). The registry is **break-glass-only** now; nothing pushes to it. The +`forgejo-cleanup` CronJob stays in `DRY_RUN` (nothing to clean). Pull-through +caches on the registry VM (`10.0.20.10`) are unchanged. See +`docs/runbooks/forgejo-registry-breakglass.md`. + +### Image registry / pull path + +1. **Containerd `hosts.toml`** redirects pulls from docker.io and ghcr.io to the + pull-through cache at `10.0.20.10` (5000 = docker.io, 5010 = ghcr.io). +2. **Pull-through cache** serves cached images from the LAN, fetches upstream on + a miss. +3. **Kyverno ClusterPolicies** sync `ghcr-credentials` (private-ghcr allowlist) + and `registry-credentials` to namespaces. + +## Woodpecker — what it still runs + +Woodpecker is **deploy + cluster-touching steps only**: | Pipeline | File | Purpose | |----------|------|---------| -| default | `.woodpecker/default.yml` | Terragrunt apply on push | -| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron | -| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries | -| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes | -| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory | -| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` | -| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host | -| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent | -| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection | -| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues | +| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) | +| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) | +| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron | +| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) | | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec | +| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change | +| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE | +| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues | +| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new post-mortems | +| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered deploy for the portal | +| breakglass-infra-ci | `.woodpecker/breakglass-infra-ci.yml` | **Manual** ghcr pull-and-save of infra-ci to the registry VM | + +**No build/test pipeline exists on any repo.** Do not (re)introduce one. + +### Woodpecker API + +Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths +(those return HTML). The deploy registration for each app is the **GitHub +mirror** repo (registered github-forge). IDs are stable across renames and must +be looked up from the Woodpecker UI/DB. + +### Woodpecker YAML gotchas + +- Commands with `${VAR}:${VAR}` must be **quoted** — an unquoted `:` triggers + YAML map parsing when the vars are empty. +- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility). +- Global secrets must include `manual` in their events list for API-triggered + pipelines. + +### GitHub repo secrets + +Per repo: `WOODPECKER_TOKEN` (POST the deploy pipeline), `FORGEJO_GIT_TOKEN` +(write:repository PAT for the `svu` tag push). ghcr push uses the workflow's +built-in `GITHUB_TOKEN` (`packages: write`). + +## Infra repo CI topology + +The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo +forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id +1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml` +(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push` +Slack audit step. Operational facts (2026-06-10): + +- **Webhook URL is the IN-CLUSTER service**: + `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed + via the Forgejo API). The Woodpecker default (`https://ci.viktorbarzin.me/...`) + resolves to the non-proxied public A record from pods → NAT hairpin → + intermittent `context deadline exceeded`, silently dropping push events. If + Woodpecker "repairs" the repo it rewrites the hook back to `ci.viktorbarzin.me` + — re-apply the in-cluster URL. +- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference + repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, …). + When registering a new forge repo for infra, clone the secret set too. +- **Empty commits defeat path filters**: a commit with no changed files makes + Woodpecker include ALL workflow files (path conditions can't exclude), so every + repo secret must resolve. Normal commits with real files only compile the + matching workflows. + +The Forgejo trigger is not fully dependable — land infra changes by pushing +Forgejo master (as viktor), use `[ci skip]` for docs/no-op commits, and verify +deploys via `scripts/tg` + live cluster state rather than trusting the CI +checkmark. The two remotes have **diverged** (parallel histories under +different SHAs); expect github pushes to reject non-fast-forward and leave them +— never force-push. ## Configuration -### GitHub Actions - -**File**: `.github/workflows/build-and-deploy.yml` +### GitHub Actions (per-app `.github/workflows/build.yml`) ```yaml -name: Build and Deploy +name: build on: push: - branches: [main, master] + branches: [master] jobs: build: runs-on: ubuntu-latest + permissions: + contents: write # svu tag push + packages: write # ghcr push steps: - - name: Build Docker image - run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} . - - name: Push to DockerHub - run: docker push viktorbarzin/app:${SHORT_SHA} - - name: Trigger Woodpecker Deploy + - uses: actions/checkout@v4 + - name: lint + test + run: make lint test + - name: svu tag -> Forgejo run: | - curl -X POST https://ci.viktorbarzin.me/api/repos/<REPO_ID>/pipelines \ - -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" + VERSION=$(svu next) + # ... push tag to canonical Forgejo with FORGEJO_GIT_TOKEN + - uses: docker/setup-buildx-action@v3 + - uses: docker/build-push-action@v6 + with: + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/<name>:${{ github.sha }} + ghcr.io/viktorbarzin/<name>:latest + deploy: + needs: build + runs-on: ubuntu-latest + steps: + - name: Trigger Woodpecker deploy + run: | + curl -X POST https://ci.viktorbarzin.me/api/repos/<DEPLOY_REPO_ID>/pipelines \ + -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \ + -d '{"branch":"master","variables":{"IMAGE_TAG":"...","IMAGE_NAME":"..."}}' ``` -**Required GitHub Secrets**: -- `DOCKERHUB_USERNAME` -- `DOCKERHUB_TOKEN` -- `WOODPECKER_TOKEN` - -### Woodpecker Deploy Pipeline - -**File**: `.woodpecker/deploy.yml` +### Woodpecker deploy pipeline (per-app `.woodpecker/deploy.yml`) ```yaml when: - event: [deployment] + event: manual steps: deploy: - image: bitnami/kubectl:latest + image: bitnami/kubectl:latest # uses the in-cluster woodpecker-agent SA (cluster-admin) commands: - - kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8} - secrets: [k8s_token] - + - "kubectl set image deployment/app app=${IMAGE_NAME}:${IMAGE_TAG} -n <ns>" + - "kubectl rollout status deployment/app -n <ns> --timeout=300s" notify: image: plugins/slack - settings: - webhook: ${SLACK_WEBHOOK} when: status: [success, failure] ``` -**YAML Gotchas**: -- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty -- Use `bitnami/kubectl:latest` (not pinned versions) -- Global secrets must be manually added to `secrets:` list in pipeline +### CI/CD secrets sync -### Vault Configuration - -**K8s Auth for Woodpecker**: -- Woodpecker pipelines authenticate using ServiceAccount JWT -- Vault K8s auth mount validates JWT and issues token -- Policies grant access to secrets and dynamic credentials - -### CI/CD Secrets Sync - -**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours -- Keeps Woodpecker global secrets in sync with Vault -- Runs in `woodpecker` namespace +A CronJob in the `woodpecker` namespace pushes `secret/ci/global` from Vault → +the Woodpecker API every 6h, keeping global secrets in sync. Woodpecker deploy +pipelines authenticate to the cluster via the in-cluster `woodpecker-agent` SA +(cluster-admin); Vault K8s auth backs any secret reads. ## Decisions & Rationale -### Why GitHub Actions + Woodpecker? +### Why all builds off-infra (ADR-0002)? -**Alternatives considered**: -1. **Woodpecker-only**: Simple, but wastes cluster resources on builds -2. **GHA-only**: No cluster access, requires kubectl from outside (security risk) -3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access) +- **Breaks the circular dependency** — the images needed to repair the cluster + no longer live inside it (they're on ghcr, an external registry). +- **Removes build IO + registry push load** from the contended homelab spindle. +- GHA is free on public repos and generous on private; buildx provenance:false + sidesteps the orphaned-index-children failure class that plagued the + in-cluster registry. +- **Clean cut** — no in-cluster fallback builds anywhere; one pattern, + fleet-wide. -**Benefits**: -- Free compute for builds on public repos -- Cluster access stays internal (Woodpecker has direct K8s access) -- Separation of concerns: build vs deploy +### Why ghcr (not push back to Forgejo)? -### Why 8-Character SHA Tags (Not :latest)? +Forgejo's container registry repeatedly orphaned OCI index children +(2026-04-13/19, 2026-05-04, 2026-06-10) and its retention is not container-aware. +ghcr is external (DR-safe), free for this scale, and has native multi-arch +handling. The Forgejo registry was frozen + emptied (issue #32). -- Pull-through cache serves stale `:latest` tags indefinitely -- SHA tags ensure every deployment pulls the correct image -- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations) +### Why Woodpecker stays for deploy? -### Why Numeric Repo IDs for Woodpecker API? +`kubectl set image` needs in-cluster privileged access; doing it from GHA would +mean exposing kube-apiserver or a long-lived kubeconfig. Woodpecker's +`woodpecker-agent` SA is already cluster-admin in-cluster — the deploy step +needs no credentials. -- Woodpecker API requires numeric IDs (not owner/name slugs) -- IDs are stable across repo renames -- Must be manually looked up from Woodpecker UI or database +### Why `event: manual` on deploy.yml? -### Why linux/amd64 Only? +The Forgejo→GitHub push-mirror sends raw, tag-less pushes to the GitHub mirror. +If `deploy.yml` fired on `push`, every mirror sync would trigger a deploy with no +image tag. `manual` means only the GHA `deploy` job's explicit API POST (with +`IMAGE_TAG`) deploys. -- Cluster runs on x86_64 nodes only -- ARM builds would waste time and storage -- Multi-arch images add complexity without benefit +### Why linux/amd64 only? + +The cluster runs on x86_64 nodes only; ARM builds waste time and storage. ## Troubleshooting -### GHA Build Fails: "denied: requested access to the resource is denied" +### GHA build fails: ghcr push "denied" -**Cause**: DockerHub credentials expired or incorrect +The workflow `GITHUB_TOKEN` needs `packages: write` permission and the package +must allow the repo to push. Check the workflow `permissions:` block and the +package's "Manage Actions access" settings. + +### Image pull fails: "ErrImagePull" / "ImagePullBackOff" -**Fix**: ```bash -# Regenerate DockerHub token -# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN +# Public image — check the pull-through cache is up +curl http://10.0.20.10:5010/v2/_catalog + +# Private image — verify the ghcr-credentials Secret exists in the namespace +kubectl get secret ghcr-credentials -n <namespace> +# It's Kyverno-synced to an allowlist; if missing, the namespace isn't on the +# allowlist in stacks/kyverno/modules/kyverno/ghcr-credentials.tf ``` -### Woodpecker Deploy Fails: "Unauthorized" +If the cause is the internal-DNS hairpin (fresh pulls timing out on the public +Forgejo path), see the CoreDNS `viktorbarzin.me` carve-out in +`docs/architecture/networking.md` and `docs/runbooks/registry-vm.md`. -**Cause**: Vault K8s auth token expired or invalid +### Deploy didn't happen after a push -**Fix**: -```bash -# Restart Woodpecker pipeline (token auto-renewed) -# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer -``` +Confirm the push was to **master** (feature branches build/deploy nothing). +Check the GHA run completed the `deploy` job, then check Woodpecker received the +manual pipeline (`ci.viktorbarzin.me`, the GitHub-mirror deploy repo). Verify +live with `kubectl rollout status` — not the CI checkmark. -### Image Pull Fails: "ErrImagePull" +### Woodpecker deploy fails: "YAML: did not find expected key" -**Cause**: Pull-through cache or registry credentials issue - -**Fix**: -```bash -# Check pull-through cache is running -curl http://10.0.20.10:5000/v2/_catalog - -# Verify registry-credentials Secret exists in namespace -kubectl get secret registry-credentials -n <namespace> - -# Manually sync credentials if missing -kubectl get secret registry-credentials -n default -o yaml | \ - sed 's/namespace: default/namespace: <namespace>/' | kubectl apply -f - -``` - -### Woodpecker Pipeline: "YAML: did not find expected key" - -**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty - -**Fix**: Quote the command: -```yaml -commands: - - "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}" -``` - -### travel_blog Build Times Out on GHA - -**Cause**: 5.7GB content directory exceeds GHA disk/time limits - -**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources. - -### CI/CD Secrets Out of Sync - -**Cause**: CronJob failed to sync Vault → Woodpecker - -**Fix**: -```bash -# Check CronJob status -kubectl get cronjob -n woodpecker - -# Manually trigger sync -kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker -``` +Unquoted command with `${VAR}:${VAR}` syntax when a VAR is empty. Quote the +command (see the deploy.yml example above). ## Related -- [Databases Architecture](./databases.md) — Database credentials via Vault -- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access -- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app -- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues -- Vault documentation: K8s auth configuration -- Woodpecker documentation: API reference +- ADR: `../adr/0002-all-image-builds-off-infra-gha-ghcr.md` — the decision +- [Databases Architecture](./databases.md) — database credentials via Vault +- [Multi-Tenancy](./multi-tenancy.md) — per-user Woodpecker access +- Runbook: `../runbooks/forgejo-registry-breakglass.md` — using the frozen registry +- Runbook: `../runbooks/registry-vm.md` — pull-through cache VM + image-pull debugging +- Onboarding tool: `../../scripts/offinfra-onboard` + `../../scripts/offinfra-templates/` diff --git a/docs/architecture/compute.md b/docs/architecture/compute.md index d4ccf6e1..0f6ff092 100644 --- a/docs/architecture/compute.md +++ b/docs/architecture/compute.md @@ -22,9 +22,11 @@ graph TB NODE2["VM 202: k8s-node2<br/>8c / 32GB"] NODE3["VM 203: k8s-node3<br/>8c / 32GB"] NODE4["VM 204: k8s-node4<br/>8c / 32GB"] + NODE5["VM 205: k8s-node5<br/>8c / 32GB"] + NODE6["VM 206: k8s-node6<br/>8c / 32GB"] end - subgraph K8s["Kubernetes Cluster v1.34.2"] + subgraph K8s["Kubernetes Cluster v1.34.8"] direction TB subgraph VPA["VPA (Goldilocks - Initial Mode)"] @@ -62,7 +64,7 @@ graph TB | Model | Dell PowerEdge R730 | | CPU | 1x Intel Xeon E5-2699 v4 (22 cores / 44 threads, CPU2 unpopulated) | | Total Cores/Threads | 22 cores / 44 threads | -| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). VMs use ~176GB total (k8s-node1 48GB + 4 K8s VMs x 32GB) | +| RAM | 272GB DDR4-2400 ECC RDIMM physical (10 DIMMs: 8x32G Samsung + 2x8G Hynix). K8s VMs use ~240GB total (k8s-node1 48GB + 6 K8s VMs x 32GB) | | GPU | NVIDIA Tesla T4 (16GB GDDR6, PCIe 0000:06:00.0) | | Storage | 1.1TB SSD + 931GB SSD + 10.7TB HDD | | Hypervisor | Proxmox VE | @@ -76,8 +78,10 @@ graph TB | k8s-node2 | 202 | 8 | 32GB | vmbr1:vlan20 | Worker | None | | k8s-node3 | 203 | 8 | 32GB | vmbr1:vlan20 | Worker | None | | k8s-node4 | 204 | 8 | 32GB | vmbr1:vlan20 | Worker | None | +| k8s-node5 | 205 | 8 | 32GB | vmbr1:vlan20 (10.0.20.105) | Worker (joined 2026-05-26) | None | +| k8s-node6 | 206 | 8 | 32GB | vmbr1:vlan20 (10.0.20.106) | Worker (joined 2026-05-26) | None | -**Total Cluster Resources**: 48 vCPUs, ~176GB RAM (k8s-node1 48GB + 4 nodes x 32GB) +**Total Cluster Resources**: 64 vCPUs, ~240GB RAM (k8s-node1 16c/48GB + master and 5 workers at 8c/32GB each) > **All Linux VMs are hand-managed in Proxmox, NOT in Terraform** > (decided 2026-05-26, commit 44c3770a). The telmate/proxmox v3.0.2 @@ -97,7 +101,12 @@ graph TB > PVE host (sources in `infra/scripts/`, install pattern per > `architecture/backup-dr.md`). Timer fires `OnBootSec=5min` + > `OnCalendar=hourly`, so any drift (config restore, manual `qm -> set`, fresh clone) self-heals within the hour. Current caps: +> set`, fresh clone) self-heals within the hour. The script compares +> *normalized option sets*, so an unchanged config is a true no-op — +> until 2026-06-11 a raw string compare (defeated by `qm config`'s +> canonical key order) re-issued `qm set` hourly against running VMs, +> live-rewriting QEMU throttle state via QMP (implicated in the devvm +> I/O stall; see `post-mortems/2026-06-11-devvm-qemu-io-stall.md`). Current caps: > 102 devvm 60/60, 103 home-assistant 40/40, 200 k8s-master 100/60, > 201 k8s-node1 150/120, 202 k8s-node2 150/120, 203 k8s-node3 150/120, > 204 k8s-node4 150/120, 220 docker-registry 40/40. diff --git a/docs/architecture/dns.md b/docs/architecture/dns.md index e90956d2..6150d226 100644 --- a/docs/architecture/dns.md +++ b/docs/architecture/dns.md @@ -258,19 +258,27 @@ The TP-Link AP (dumb AP on 192.168.1.x) does not support hairpin NAT. LAN client Technitium's **Split Horizon AddressTranslation** app post-processes DNS responses for 192.168.1.0/24 clients, translating the public IP to the internal Traefik LB IP: ``` -176.12.22.76 → 10.0.20.200 +176.12.22.76 → 10.0.20.203 ``` +(Was `10.0.20.200` until Traefik's 2026-05-30 move to its dedicated `.203` LB IP.) + **DNS Rebinding Protection** has `viktorbarzin.me` in `privateDomains` to allow the translated private IP without being stripped as a rebinding attack. ### Scope - **Affected**: Non-proxied domains (ha-sofia, immich, headscale, calibre, vaultwarden, etc.) for 192.168.1.x clients - **Not affected**: Cloudflare-proxied domains (resolve to Cloudflare edge IPs, no translation needed) -- **Not affected**: 10.0.x.x and K8s clients (reach public IP via pfSense outbound NAT normally) +- **10.0.x.x clients (k8s nodes, devvm, other VMs)** — handled at the resolver since 2026-06-10: **pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium** (`10.0.20.201`). Technitium's split-horizon zone answers with the zone apex A record, which auto-tracks the live Traefik LB IP (`technitium-ingress-dns-sync` CNAMEs every ingress host hourly; `viktorbarzin-apex-probe` is the drift canary). Every client of pfSense Unbound — all VLANs, k8s nodes included — therefore gets internal answers with **zero per-host configuration** (no `/etc/hosts` pins, no resolved drop-ins; both earlier same-day approaches were removed, nodes are stock). Names not behind Traefik keep distinct records in the zone (e.g. `mail.viktorbarzin.me → 10.0.20.1`, verified working on :993/:25; since 2026-06-10 its :443 also works internally — pfSense carries an SNI-routed HAProxy frontend on 443 that sends hostname traffic to Traefik and bare-IP/no-SNI traffic to the webGUI, which moved to :8443; see `docs/runbooks/mailserver-pfsense-haproxy.md`). See `docs/runbooks/pfsense-unbound.md` for the override config + rollback, and `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md` for the incident that motivated this (kubelet forgejo pulls riding the broken hairpin; the containerd hosts.toml mirror cannot fix it — Traefik 404s bare-IP requests and the registry auth realm is an absolute public URL). + - **devvm**: also covered by a `~viktorbarzin.me → 10.0.20.201` resolved routing domain (predates the pfSense override, provisioned by `setup-devvm.sh`) — redundant-but-harmless belt-and-suspenders. + - **in-cluster PODS are ordinary internal clients too** (since 2026-06-10 evening): CoreDNS's dedicated `viktorbarzin.me:53` block (in `stacks/technitium`, TF-managed) forwards to the Technitium ClusterIP (`10.96.0.53`, same as the `.lan` block), so pods get the same split-horizon answers as everyone else. This works because on k8s 1.34 **pods CAN reach the ETP=Local Traefik LB IP** — kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path (verified from pods on three non-Traefik nodes; re-verify after major k8s upgrades — the canary is the uptime-kuma `[External]` fleet going red). forgejo stays pinned to Traefik's **ClusterIP** in the same block so CI pushes survive a Technitium outage. History: the block briefly forwarded to `8.8.8.8/1.1.1.1` (morning of 2026-06-10), which kept pods on public IPs and the broken TP-Link NAT loopback — 27 non-proxied `[External]` uptime-kuma monitors dark (beads code-yh33). Note: in-cluster `[External]` monitors now test DNS+Traefik+service via the internal path for ALL names, including Cloudflare-proxied ones — genuine edge-path fidelity is the job of a true external vantage (ha-london), not in-cluster probes. + - **Trade-off**: `viktorbarzin.me` resolution via pfSense now depends on in-cluster Technitium (3 replicas). During a full cluster outage the zone SERVFAILs LAN-wide — acceptable, the services behind it are down anyway; node bootstrap images pull via the IP-addressed `10.0.20.10` mirrors, so cold-start self-unwinds. + - **Residual nondeterminism**: nodes keep `94.140.14.14` as a secondary resolver (netplan/qm `--nameserver`). If systemd-resolved fails over to it during a pfSense DNS blip, `.me` answers are public again until it switches back — a rare, self-healing window, accepted. Config is synced to all 3 Technitium instances by CronJob `technitium-split-horizon-sync` (every 6h). +**Superset rule for the internal `viktorbarzin.me` zone**: it is authoritative for every internal client (pods included since 2026-06-10), so it must carry every record type those clients consume — not just ingress A/CNAMEs. The `technitium-ingress-dns-sync` CronJob therefore also maintains the static **mail-auth records** (apex SPF + brevo-code TXT, MX → mail.viktorbarzin.me, `_dmarc`, `mail._domainkey` DKIM), mirrored from the public Cloudflare zone. Without them, rspamd on the mailserver saw `SPF=none` for inbound `@viktorbarzin.me` mail and quarantined it (broke the Brevo email-roundtrip probe, 2026-06-10). If these records change in Cloudflare, update the sync script too. + ## NodeLocal DNSCache A DaemonSet in `kube-system` (`node-local-dns`, image `registry.k8s.io/dns/k8s-dns-node-cache:1.23.1`) runs on every node including the control plane. Each pod uses `hostNetwork: true` + `NET_ADMIN` and installs iptables NOTRACK rules so it transparently serves DNS on both: @@ -456,13 +464,21 @@ The zone-sync CronJob (runs every 30min) pushes the following to the Prometheus ### Hairpin NAT Not Working (LAN → *.viktorbarzin.me Fails) -Since 2026-04-19 (Workstream D), pfSense Unbound answers LAN DNS queries -directly instead of forwarding to Technitium, so the Technitium Split Horizon -post-processing does NOT run for 192.168.1.x clients anymore. Non-proxied -services break hairpin on LAN clients again. Options: +**Since 2026-06-10 this is largely solved at the resolver**: pfSense Unbound +carries a domain override forwarding the entire `viktorbarzin.me` zone to +Technitium, so ANY client that queries pfSense (all VLANs + 192.168.1.x +clients pointed at `192.168.1.2`) gets the internal Traefik answer. If +hairpin still fails for a client, first check which resolver it actually +uses — clients on the TP-Link's own DHCP DNS (router/ISP) bypass pfSense +entirely. Options for those: + +(Historical context: 2026-04-19 Workstream D made Unbound answer LAN +queries directly, which had removed the Technitium Split Horizon +post-processing from the LAN path until the 2026-06-10 domain override +restored internal answers at the zone level.) 1. **Switch service to proxied Cloudflare** (preferred) — set `dns_type = "proxied"` in the `ingress_factory` module call; DNS now resolves to Cloudflare edge, hairpin-independent. -2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.200` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver. +2. **Add a local-data override on pfSense Unbound** — under `Services → DNS Resolver → Host Overrides`, set `<service>.viktorbarzin.me → 10.0.20.203` (Traefik LB IP). This is equivalent to what Split Horizon did, applied at the resolver. 3. **Revert to prior NAT rdr + Technitium Split Horizon** — documented in `docs/runbooks/pfsense-unbound.md` rollback section. K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me` queries DO reach Technitium (e.g., from pods that query via CoreDNS → Technitium forwarding for `.viktorbarzin.me` via pfSense). Verify Technitium split-horizon app: @@ -470,7 +486,7 @@ K8s-side Split Horizon is still configured and applies when `*.viktorbarzin.me` 1. Verify Split Horizon app is installed on all instances 2. Check CronJob status: `kubectl get cronjob -n technitium technitium-split-horizon-sync` 3. Run the job manually: `kubectl create job --from=cronjob/technitium-split-horizon-sync test-sh -n technitium` -4. Test: `dig @10.0.20.201 immich.viktorbarzin.me` — should return 10.0.20.200 for 192.168.1.x source +4. Test: `dig @10.0.20.201 immich.viktorbarzin.me` — should return 10.0.20.203 for 192.168.1.x source ### Zone Not Replicating to Secondary/Tertiary diff --git a/docs/architecture/monitoring.md b/docs/architecture/monitoring.md index 28daac25..3c75a345 100644 --- a/docs/architecture/monitoring.md +++ b/docs/architecture/monitoring.md @@ -119,12 +119,18 @@ no `level` stream label. cluster error/warn line counts (5-min window) → `sensor.cluster_log_errors_5m` / `sensor.cluster_log_warnings_5m`, for a compact trend card on the Барзини status view plus a Grafana-link button. Those sensors reach Loki via the Traefik LB IP -`10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`) -because `loki.viktorbarzin.lan` has **no Technitium record yet** (the -`technitium-ingress-dns-sync` CronJob only creates `.me` CNAMEs + pins -`ingress.viktorbarzin.lan`). **Follow-up:** register `loki.viktorbarzin.lan` in -Technitium (or fix the `*.viktorbarzin.lan` wildcard) so both this sensor and the -Sofia-Pi promtail can resolve it by name instead of pinning the LB IP. +`10.0.20.203` + a `Host: loki.viktorbarzin.lan` header (`verify_ssl: false`). +**Update 2026-06-10:** `loki.viktorbarzin.lan` is now **registered in Technitium** +as a CNAME → `ingress.viktorbarzin.lan` (the anchor whose A record auto-tracks the +live Traefik LB IP), added via the Technitium API and AXFR-replicated to all 3 +instances — so it resolves by name LAN-wide. The **PVE host** promtail (see +"External host: pve" below) uses the name directly, with **no `/etc/hosts` pin**. +This HA sensor and the rpi-sofia promtail still pin the LB IP in their own configs +and can drop to the name on next touch (`verify_ssl: false` / `insecure_skip_verify` +stays — the internal `.lan` cert isn't publicly trusted). Per-host `.lan` CNAMEs +are still added manually via the API; auto-managing them in +`technitium-ingress-dns-sync` (today `.me`-only + the `ingress.viktorbarzin.lan` +anchor) remains a follow-up. ### External host: rpi-sofia (Sofia Raspberry Pi) @@ -140,12 +146,29 @@ Query examples (Grafana → Loki): `{job="rpi-sofia-journal"}`, `{job="rpi-sofia **Dashboard** — `dashboards/rpi-sofia.json` ("RPi Sofia", Hardware folder): status, undervoltage/throttle, SoC temp, load, memory, root-fs free + read-only, network. -**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`rpi_under_voltage_occurred==1`), `RpiSofiaHighTemp`. +**Alerts** (group `RPi Sofia` in `prometheus_chart_values.tpl`): `RpiSofiaDown` (`up==0`), `RpiSofiaFilesystemReadonly` (`node_filesystem_readonly{mountpoint="/"}==1` — the SD-failure signature), `RpiSofiaUndervoltage` (`increase(rpi_under_voltage_occurred[1h])>0` — edge-triggered on the sticky bit; the live `rpi_under_voltage_now` bit is too transient to catch at 1-min sampling, so it fires on a *new* brown-out and auto-resolves ~1h later instead of latching until reboot), `RpiSofiaHighTemp`. **Recovery** — a systemd hardware watchdog (`RuntimeWatchdogSec=14s`, bcm2835 max ~15s) auto-reboots the Pi on a hard hang instead of leaving it dead for hours. > The cluster side (scrape job, alerts, Loki ingress, dashboard) is Terraform-managed in `stacks/monitoring/`. The **Pi-side** pieces (node_exporter, the textfile collector + timer, promtail, the watchdog config, and the `server=/viktorbarzin.lan/192.168.1.2` dnsmasq split-horizon forward needed to resolve the Loki ingress) are configured by hand on the Pi — it is not under Terraform — and are backed up off-box at `/home/wizard/rpi-sofia-backup/`. The real reliability fix (reflash/replace the SD card) needs on-site access. +### External host: pve (Proxmox hypervisor, 192.168.1.127) + +`pve` is the Proxmox VE host — the hypervisor running **every** VM (pfSense, the 5 k8s nodes, the devvm, HA, Windows). It is not in the cluster. Since 2026-06-10 its **full systemd journal ships to cluster Loki**, closing a gap (the most critical host previously had no central logging) and giving the Wave-1 **S1** security rule its data source (`docs/architecture/security.md`). + +**Why now:** emo's Claude agent was granted **root SSH** to the host (a dedicated shared-root key `emo-pve-agent@devvm`, fingerprint `SHA256:Wd+m0EABlm4RDDykDh85PIYSqe0Al8Hr9AZ+7Ksy4HQ`, reachable as `ssh pve` from the devvm) so he can manage the host (e.g. the R730 fan daemon) via his agent. To keep an audit trail, **snoopy** (enabled via `/etc/ld.so.preload` → `libsnoopy.so`; config `scripts/pve-snoopy.ini`) logs every `execve()` to journald under identifier `snoopy`, and promtail ships it to Loki. + +**Logs** — `promtail` v3.5.1 (amd64) at `/usr/local/bin/promtail`, config `scripts/pve-promtail.yaml`, unit `scripts/pve-promtail.service`. Ships `/var/log/journal` to `https://loki.viktorbarzin.lan/loki/api/v1/push` (`insecure_skip_verify` — the internal `.lan` cert isn't publicly trusted; the name resolves via the Technitium CNAME above, no `/etc/hosts` pin). Relabels: `unit`, `level`, `identifier`; sshd lines (`identifier=~"sshd.*"`) are re-jobbed to `sshd-pve` so the S1 rule matches. Streams: +- `{job="pve-journal", host="pve"}` — full host journal (kernel, pvestatd, fan-control, NFS, etc.). +- `{job="pve-journal", identifier="snoopy"}` — **command audit** (every execve: `uid login tty sid cwd cmdline`). +- `{job="sshd-pve"}` — sshd auth; an `Accepted publickey ... SHA256:<fp>` line ties a session to a key (e.g. emo's fp above). Feeds S1. + +**Attribution caveat:** all SSH is shared-root, so snoopy `uid`/`login` are always `root`; attribute a command to a person by correlating its `sid`/timestamp with the matching `{job="sshd-pve"}` Accepted-publickey line (key fingerprint). emo's agent arrives SNAT'd as `192.168.1.2`, which is in the S1 allowlist, so legitimate access does not alert. + +Query examples (Grafana → Loki): `{host="pve"}`, `{job="pve-journal", identifier="snoopy"}` (command audit), `{job="sshd-pve"} |= "Accepted publickey"`. + +> Hand-managed (not Terraform), like the rpi-sofia and fan-control pieces: the promtail binary/config/unit and the snoopy enable (`/etc/ld.so.preload`) live on the host (Loki resolves via the Technitium CNAME — no `/etc/hosts` pin). Source-of-truth files: `scripts/pve-promtail.{yaml,service}` + `scripts/pve-snoopy.ini`; deploy steps are in the `pve-promtail.yaml` header. + ### Dell R730 iDRAC: SNMP-primary + Redfish remnant (migrated 2026-06-05) The R730 iDRAC (`192.168.1.4` / `idrac.viktorbarzin.lan`) is monitored by **two** Prometheus jobs, both relabeled to the `r730_idrac_*` prefix (which historically hid which source served what). Design/plan: `docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md`. diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md index 2e66ae21..c64a146c 100644 --- a/docs/architecture/multi-tenancy.md +++ b/docs/architecture/multi-tenancy.md @@ -541,11 +541,33 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **RBAC tiers:** `admin` (Viktor — cluster-admin, unlocked tree, secrets) · `power-user` (cluster-wide read-only, NO Secrets, via a dedicated `oidc-power-user-readonly` ClusterRole) · `namespace-owner` (admin in own namespace only). Each session acts as the user's **own** OIDC identity (kubelogin), never the admin's. -**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. +**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. -**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo at `~/code` — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Changes are ungated (push ≠ apply); the real boundary is apply-time (`scripts/tg apply` needs an admin Vault token + cluster RBAC). +**Memory — homelab CLI hooks (rolled out 2026-06-21, deploy-fixed 2026-06-22):** the per-user `claude_memory` MCP was retired for the **homelab-memory hooks** — the reconcile's `install_memory` (re)installs four scripts into `~/.claude/hooks/` each run (`homelab-memory-recall.py` UserPromptSubmit recall, `auto-learn.py` Stop-hook extraction, `pre-compact-backup.sh`/`post-compact-recovery.sh`), wires them into `settings.json` if-absent + additive, and removes the old `claude_memory` MCP. **The provisioner binary itself now self-deploys from the repo** (step 0: `bash -n`-gated `install` + re-exec when `scripts/t3-provision-users.sh` differs from `/usr/local/bin/t3-provision-users`, guarded against re-exec loops / DRY_RUN mutation) — added after this very rollout sat committed-but-undeployed for a day (only the manual `setup-devvm.sh` had ever deployed the binary), so the hourly reconcile kept running the pre-memory version and emo/anca silently lost memory (recall + auto-learn never wired). A latent `set -e` abort in `install_memory` (a bare `[[ -d plugin-dir ]] && …` returning non-zero) was also fixed; it had killed the reconcile after the first user the first time it actually ran. The hooks need a `MEMORY_API_KEY` (or `CLAUDE_MEMORY_API_KEY`) in the user's `settings.json` env — the `homelab` CLI defaults the API URL, so **the key is the only hard requirement**; `install_memory` reuses an existing key and only WARNs if absent (it does NOT mint one — that's an admin Vault step, see Remaining). wizard + emo carry a key from their original MCP setup; **ancamilea is keyless → her memory no-ops until a key is minted.** (`auto-learn.py`'s passive store calls the API directly, so it additionally needs `*_API_URL` in env to avoid its local-SQLite fallback; recall + manual `homelab memory store` go through the URL-defaulting CLI and need only the key.) -**Status (2026-06-08):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, **per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), and the Authentik `T3 Users` edge gate (applied + verified)**. **Remaining (held / future):** the emo cutover to his own locked clone (Phase 5), the offboarding apply-side (Phase 7), per-user MCP/auth injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning. +**Agent skills — vendored own-copies for an allowlist (2026-06-23):** beyond the config-inheritance base (above, which symlinks the admin's `~/.claude/skills` into every user), the reconcile's `install_skills` gives users on the `SKILL_USERS` allowlist (currently `emo`) their OWN copies of a curated skill set vendored in-repo at `scripts/workstation/claude-skills/` (16: the admin's 15 `mattpocock/skills` + `find-skills` from `vercel-labs/skills`). It copies each into `~/.agents/skills/<name>` (owned by the user, parent `~/.agents` chowned too — `install -d` leaves intermediates root-owned) and points `~/.claude/skills/<name>` at it with a **relative** symlink (`../../.agents/skills/<name>` — the layout `skills add -g` produces; Claude Code reads `~/.claude/skills/`). **Vendored, NOT `npx skills add`:** upstream drifted off this exact set (`diagnose`→`diagnosing-bugs`, `write-a-skill`→`writing-great-skills` renamed; `caveman` + `zoom-out` unpublished), so npx can't reproduce it — and a per-reconcile GitHub clone + unpinned-CLI dependency has no place in the hourly root job; refresh by re-snapshotting (`claude-skills/README.md`). **if-absent keys on the user's OWN copy** (a real dir under `~/.agents/skills`), so a steady-state reconcile is a no-op AND a stale or cross-user `~/.claude/skills` symlink is healed to the own copy — emo had `grill-me`/`file-issue` symlinked into the admin's home; `grill-me` is now emo's own (`file-issue` is outside the set, left as-is). A real dir squatting a name is never clobbered. Best-effort tail (`return 0`, like `install_memory`). Extend coverage = edit `SKILL_USERS`. + +**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it). + +**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`). + +**Claude authentication — per-user, self-renewing, Vault-recoverable (2026-06-20):** every roster user logs in with their OWN Enterprise identity; shared `CLAUDE_CODE_OAUTH_TOKEN` injection was removed because environment auth outranks local login and collapses identity/audit/quota. Claude owns access-token refresh in `~/.claude/.credentials.json`. A system template timer (`claude-auth-sync@<user>.timer`, every 6h) renews a dedicated 32-day periodic Vault token, validates Claude with real non-persistent Haiku inference (`auth status` can lie during a 401), backs up only `claudeAiOauth` to `secret/workstation/claude-users/<os_user>`, and performs one atomic Vault restore/retry on failure while preserving `mcpOAuth`. Vault policy `workstation-claude-<os_user>` isolates every path; the roster generates policies for present and future users. A hard refresh-token revocation still requires the affected person to complete SSO—there is no supported noninteractive bypass. Loki alert `WorkstationClaudeAuthInvalid` surfaces exhausted recovery. Runbook: `../runbooks/claude-auth-renew-workstation.md`. + +**Per-user browser MCP — playwright, reproducible from git (2026-06-16):** every user (incl. the admin) gets their OWN isolated `@playwright/mcp` server so their concurrent Claude sessions don't fight over tabs (`--isolated` → a fresh browser context per MCP connection), wired into Claude in **every directory** via a user-scope `~/.claude.json` entry (`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). Mechanism: **system-level template units** `playwright-mcp@<user>.service` + `playwright-snapshot-refresh@<user>.{service,timer}` (`User=%i`, sourced from `scripts/workstation/playwright/`, installed by `setup-devvm.sh` §9e — system manager, so NO systemd --user / linger). `roster_engine.py` allocates a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); the reconcile's `install_playwright()` writes it, seeds the chrome-service snapshot token if-absent (staged from Vault `secret/chrome-service` to `/etc/t3-serve/chrome-service-token` by `setup-devvm.sh` §8c, since the hourly root reconcile has no Vault token), wires `~/.claude.json` by running `claude mcp add --scope user` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and `enable --now`s the instances (idempotent — never restarts a running server). The `@playwright/mcp` version is **pinned** in the unit (the `@latest`-silently-rolls-the-fleet footgun — see `T3_PIN`). Replaced the earlier hand-made `~/.config/systemd/user/playwright-*` units (one-time idle-gated migration; pre-migration emo/anca had servers running but never wired into their `.claude.json`). Cookie-warming pipeline + ops: `../runbooks/chrome-service-snapshot.md`. + +**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`). + +**Contribute access (per non-admin, manual — the anca/tripit PAT precedent):** +1. Add their Forgejo user as a **write** collaborator on `viktor/infra` (`PUT /api/v1/repos/viktor/infra/collaborators/<login>`). +2. Mint a PAT — the admin REST endpoint 404s here, use the in-pod CLI: `kubectl -n forgejo exec deploy/forgejo -- su -s /bin/sh git -c "forgejo admin user generate-access-token --username <login> --token-name devvm-infra-git --scopes 'write:repository'"`. +3. Install it in their `~/.git-credentials` (`https://<login>:<token>@forgejo.viktorbarzin.me`, mode 600) + `git config --global credential.helper store`, set `user.name`/`user.email`. +4. The reconcile wires the clone side automatically (`wire_forgejo_remote`): `forgejo` remote + `master` tracking `forgejo/master` on every non-admin infra clone (origin stays the anonymous GitHub mirror). No manual step since 2026-06-10. +5. (Optional — Viktor's call per user) Grant direct master push: add their login to the `master` branch-protection push + merge whitelists (`PATCH /api/v1/repos/viktor/infra/branch_protections/master`). Done for `ebarzin` 2026-06-10. +6. Verify: branch push succeeds; a `master` push succeeds for whitelisted users and is rejected with `Not allowed to push to protected branch` otherwise. + +**Web-terminal session persistence (2026-06-10):** the tmux-based web terminal's named sessions (each running one Claude conversation) survive devvm reboots — `tmux-persist-save.timer` (5-min) snapshots every terminal user's sessions (name, cwd, conversation uuid from argv or the cwd-slug transcript dir) to `/var/lib/tmux-persist/<user>.tsv`, and `tmux-persist-restore.service` recreates missing sessions at boot with `claude --resume <uuid>` (per-session idempotent; also handles partial loss). The web terminal also exposes an **on-demand "Restore sessions" button** (terminal-lobby: `tmux-api` `POST /restore` → the validated root `tmux-restore-user` wrapper → `tmux-persist restore <user>`, a single-user mode of the same script): the boot-only restore service never fires when an **OOM kills a user's tmux server *without* a reboot** (the common case under multi-user memory pressure), so the button covers that gap. This is a **tmux/terminal-surface** feature, deliberately outside the t3 namespace: the t3 chat surface persists its own threads (`~/.t3` state, plus the daily `t3-backup-state` dump), and Claude conversations themselves were always durable (`~/.claude/projects/`) — what this adds is the volatile tmux wiring. + +**Status (2026-06-20):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**, **per-user `code_layout` with the ancamilea workspace cutover**, per-user playwright browser MCP, and per-user Claude OAuth renewal/Vault recovery. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept**. **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user `ha`/`claude_memory`/beads credential injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning. ## Related diff --git a/docs/architecture/networking.md b/docs/architecture/networking.md index 09437069..4659038a 100644 --- a/docs/architecture/networking.md +++ b/docs/architecture/networking.md @@ -4,7 +4,7 @@ Last updated: 2026-04-19 (WS E — Kea DHCP pushes dual DNS per subnet; Kea DDNS ## Overview -The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a comprehensive middleware chain including CrowdSec bot protection, Authentik forward-auth, and rate limiting. All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. +The homelab network is built on a dual-VLAN architecture with pfSense providing gateway services, Technitium for internal DNS, and Cloudflare for external DNS. Traefik serves as the Kubernetes ingress controller with a middleware chain of anti-AI bot-blocking, Authentik forward-auth, rate limiting, and retry. CrowdSec IP-reputation enforcement is **out-of-band** (not a Traefik hop): banned IPs are dropped in-kernel via nftables on direct hosts and blocked at the Cloudflare edge on proxied hosts (see `docs/architecture/security.md`). All HTTP traffic flows through Cloudflared tunnels, avoiding the need for port forwarding or exposing public IPs. ## Architecture Diagram @@ -16,12 +16,14 @@ graph TB Traefik[Traefik Ingress<br/>3 replicas + PDB] subgraph "Middleware Chain" - CS[CrowdSec Bouncer<br/>fail-open] + AntiAI[Anti-AI bot-block<br/>fail-open] Auth[Authentik Forward-Auth<br/>3 replicas + PDB] RL[Rate Limiter<br/>429 response] Retry[Retry<br/>2 attempts, 100ms] end + CSdrop[CrowdSec drop<br/>nftables / CF edge<br/>out-of-band, pre-Traefik] + subgraph "Proxmox Host (eno1)" vmbr0[vmbr0 Bridge<br/>192.168.1.127/24] vmbr1[vmbr1 Internal<br/>VLAN-aware] @@ -53,8 +55,9 @@ graph TB Internet -->|DNS query| CF CF -->|CNAME to tunnel| CFD CFD --> Traefik - Traefik --> CS - CS --> Auth + CSdrop -.->|banned IPs dropped before Traefik| Traefik + Traefik --> AntiAI + AntiAI --> Auth Auth --> RL RL --> Retry Retry --> Service @@ -82,7 +85,7 @@ graph TB | Cloudflare DNS | SaaS | External | ~50 public domains under viktorbarzin.me | | Cloudflared | Container | K8s (3 replicas) | Tunnel ingress, replaces port forwarding | | Traefik | Helm chart | K8s (3 replicas + PDB) | Ingress controller, HTTP/3 enabled | -| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | Bot protection, fail-open bouncer | +| CrowdSec | Helm chart | K8s (LAPI: 3 replicas) | IP reputation. Out-of-band enforcement: `cs-firewall-bouncer` DaemonSet (in-kernel nftables drop, direct hosts) + Cloudflare edge WAF rule (proxied hosts). Fail-open | | Authentik | Helm chart | K8s (3 replicas + PDB) | SSO, forward-auth middleware | | MetalLB | v0.15.3 Helm chart | K8s | LoadBalancer IPs (10.0.20.200-10.0.20.220), all services on 10.0.20.200 | | Registry Cache | Container | 10.0.20.10 | Pull-through for docker.io:5000, ghcr.io:5010 | @@ -208,24 +211,31 @@ VMs tag traffic on vmbr1 to isolate workloads. pfSense bridges VLAN 20 to the up ### Ingress Flow +CrowdSec is **not** a step in this chain — banned IPs are dropped before the +request ever reaches Traefik (Cloudflare edge WAF rule on proxied hosts; host +nftables on direct hosts). The flow below is for a request that survives that +out-of-band gate. + ```mermaid sequenceDiagram participant Client - participant Cloudflare + participant CFedge as Cloudflare (edge WAF: crowdsec_ban block) participant Cloudflared participant Traefik - participant CrowdSec + participant AntiAI participant Authentik participant RateLimit participant Retry participant Service participant Pod - Client->>Cloudflare: HTTPS request to blog.viktorbarzin.me - Cloudflare->>Cloudflared: Forward via tunnel (QUIC) + Client->>CFedge: HTTPS request to blog.viktorbarzin.me + Note over CFedge: banned IP → blocked here (proxied hosts) + CFedge->>Cloudflared: Forward via tunnel (QUIC) Cloudflared->>Traefik: HTTP to LoadBalancer IP - Traefik->>CrowdSec: Apply bouncer middleware - CrowdSec->>Authentik: If allowed, check auth (protected=true) + Note over Traefik: on direct hosts, banned IPs already dropped in-kernel (nftables forward hook) + Traefik->>AntiAI: anti-AI bot-block (fail-open) + AntiAI->>Authentik: If allowed, check auth (protected=true) Authentik->>RateLimit: If authenticated, check rate limit RateLimit->>Retry: If within limit, continue Retry->>Service: Forward to Service @@ -234,24 +244,27 @@ sequenceDiagram Service-->>Retry: Response Retry-->>RateLimit: Response RateLimit-->>Authentik: Response (strip auth headers) - Authentik-->>CrowdSec: Response - CrowdSec-->>Traefik: Response + Authentik-->>AntiAI: Response + AntiAI-->>Traefik: Response Traefik-->>Cloudflared: Response - Cloudflared-->>Cloudflare: Response via tunnel - Cloudflare-->>Client: HTTPS response + Cloudflared-->>CFedge: Response via tunnel + CFedge-->>Client: HTTPS response ``` ### Middleware Chain -Every ingress created by the `ingress_factory` module follows this chain: +CrowdSec IP-reputation enforcement is **not** in this chain — it is out-of-band +(host nftables on direct hosts; the Cloudflare edge WAF `crowdsec_ban` rule on +proxied hosts), so banned IPs never reach the chain and there is no per-request +CrowdSec hop. Every ingress created by the `ingress_factory` module follows this +Traefik chain: -1. **CrowdSec Bouncer**: Checks IP against threat database. **Fail-open** mode — if LAPI is unreachable, traffic passes through to prevent outages. +1. **Anti-AI bot-block** (`ai-bot-block` ForwardAuth, on by default via `ingress_factory`): blocks/tarpits known AI crawlers. **Fail-open** (currently a no-op `return 200` — poison-fountain scaled to 0; see `docs/architecture/security.md`). 2. **Authentik Forward-Auth** (if `protected = true`): SSO authentication via OIDC. Non-authenticated users are redirected to login. Auth headers are stripped before forwarding to backend. -3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default limits are generous; services like Immich and Nextcloud have higher custom limits. +3. **Rate Limiting**: Per-IP throttling. Returns **429 Too Many Requests** (not 503) when limit exceeded. Default is `rate-limit` (average 10 req/s, burst 50). Services whose clients legitimately burst harder get a dedicated middleware via `skip_default_rate_limit = true` + `extra_middlewares`: Immich (`immich-rate-limit`, 1000/20000, photo uploads) and ActualBudget (`actualbudget-rate-limit`, 50/300 — the Actual web app boots with ~70 parallel asset/migration revalidations; the default burst 429'd the tail and stalled every page load). 4. **Retry**: 2 attempts with 100ms delay on transient failures (5xx errors, connection errors). Additional middleware: -- **Anti-AI**: On by default via `ingress_factory`. Blocks common AI crawler user-agents. - **HTTP/3 (QUIC)**: Enabled globally on Traefik. ### Entrypoint Transport Timeouts @@ -348,10 +361,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac | pfSense | `stacks/pfsense/` | VM + cloud-init config | | Technitium | `stacks/technitium/` | Deployment, Service, PVC | | Traefik | `stacks/platform/` (sub-module) | Helm release, IngressRoute CRDs | -| CrowdSec | `stacks/platform/` (sub-module) | Helm release, LAPI + bouncer | +| CrowdSec | `stacks/crowdsec/` (+ edge in `stacks/rybbit/`) | Helm release, LAPI + agent; `cs-firewall-bouncer` DaemonSet (nftables, direct hosts) + Cloudflare edge sync (proxied hosts) | | Authentik | `stacks/authentik/` | Helm release, ingress, OIDC configs | | MetalLB | `stacks/platform/` (sub-module) | Helm release, IPAddressPool | -| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config | +| Cloudflared | `stacks/cloudflared/` | Deployment (3 replicas), tunnel config; runs `--no-autoupdate` (in-place self-updates exited the pods and severed all tunnel WebSockets, 2026-06-09/10) | | ingress_factory | `modules/ingress_factory/` | IngressRoute + middleware chain | ### Key Configuration Files @@ -436,13 +449,30 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac **Decision**: Technitium handles internal `.lan` domains with near-zero latency. Cloudflare handles public domains with global DNS. K8s nodes use Technitium as primary, which forwards non-.lan queries to Cloudflare. -### Why Fail-Open on CrowdSec Bouncer? +### Why CrowdSec Enforcement Is Out-of-Band (and Fails Open) -**Alternatives considered**: -1. **Fail-closed**: Maximum security, but LAPI downtime blocks all traffic. -2. **Redundant LAPI**: Already scaled to 3 replicas, but resource pressure can still cause outages. +CrowdSec used to enforce inline as a Traefik middleware (the +`crowdsec-bouncer-traefik-plugin`). On Traefik 3.7.5 the Yaegi plugin handler was +never invoked, so it enforced nothing; the plugin was removed and enforcement +moved off the request path entirely (full history in +`docs/architecture/security.md`). It now runs on two surfaces: -**Decision**: Availability > strict bot blocking. CrowdSec LAPI is scaled to 3 replicas for resilience, but during cluster-wide resource exhaustion (e.g., memory pressure), bouncer falls back to allowing traffic. This prevents a complete service outage due to a security add-on. +- **Direct hosts** → `cs-firewall-bouncer` DaemonSet drops banned IPs in the host + nftables, in **both the `input` and `forward` hooks**. The `forward` hook is + the load-bearing one: with Traefik on a dedicated LB IP at + `externalTrafficPolicy=Local`, client packets are DNAT'd to the Traefik **pod** + and transit the node's `forward` chain (not `input`) — which is exactly why the + ingress must preserve the **real client IP** end-to-end (ETP=Local + PROXY-v2 + for IPv6; see the Traefik LB IP and IPv6 ingress notes above). Without the real + client IP the firewall-bouncer (and the CF edge rule) would have nothing to + match on. +- **Proxied hosts** → a Cloudflare edge WAF rule (`ip.src in $crowdsec_ban`) fed + by the `crowdsec-cf-sync` CronJob. + +Both **fail open**: if LAPI is unreachable, the firewall-bouncer simply stops +receiving new decisions (existing drops persist) and the CF sync skips a run — +neither ever blocks legitimate traffic. Availability > strict bot blocking, and +out-of-band enforcement adds **zero per-request latency** (no Traefik hop). ### Why HTTP/3 (QUIC)? @@ -473,9 +503,10 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac **Symptoms**: All ingress routes return 503, Traefik dashboard shows no backends available. -**Diagnosis**: Middleware chain is blocking traffic. Check: -1. Authentik status: `kubectl get pod -n authentik` -2. CrowdSec LAPI status: `kubectl get pod -n crowdsec` +**Diagnosis**: Middleware chain is blocking traffic. (CrowdSec is **not** in the +chain — a CrowdSec/LAPI outage cannot cause 503s; it only stops new bans.) Check: +1. Authentik status: `kubectl get pod -n authentik` (ForwardAuth fails closed if the auth server is unreachable) +2. `bot-block-proxy` status: `kubectl get pod -n traefik -l app=bot-block-proxy` (anti-AI ForwardAuth target — also fails closed if down) 3. Traefik logs: `kubectl logs -n kube-system deploy/traefik` **Fix**: If Authentik is down and ingress uses forward-auth, pods won't pass health checks. Scale Authentik to 3 replicas or temporarily disable forward-auth middleware. @@ -515,11 +546,11 @@ Containerd on all K8s nodes uses `hosts.toml` to redirect pulls to the local cac ### Rate Limiter Blocks Legitimate Traffic -**Symptoms**: Users report 429 errors during normal usage (e.g., Immich uploads). +**Symptoms**: Users report 429 errors during normal usage (e.g., Immich uploads, ActualBudget's "Server returned an error while checking its status" boot screen). **Diagnosis**: Check Traefik middleware config for the affected IngressRoute. -**Fix**: Increase rate limit in `ingress_factory` module. Default is 100 req/min per IP. Immich and Nextcloud use 500 req/min. +**Fix**: Give the service a dedicated higher-limit middleware (don't loosen the shared default): define `<service>-rate-limit` in `stacks/traefik/modules/traefik/middleware.tf`, then set `skip_default_rate_limit = true` + `extra_middlewares = ["traefik-<service>-rate-limit@kubernetescrd"]` on its `ingress_factory` call. Shared default is average 10 req/s / burst 50; Immich uses 1000/20000, ActualBudget 50/300. ### Large Downloads or Uploads Truncate / Fail Partway diff --git a/docs/architecture/security.md b/docs/architecture/security.md index 6b3e794b..7d3043ea 100644 --- a/docs/architecture/security.md +++ b/docs/architecture/security.md @@ -2,40 +2,50 @@ ## Overview -The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation. +The homelab implements defense-in-depth security using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). CrowdSec enforcement is **out-of-band** (not a per-request Traefik hop — see the CrowdSec section): banned IPs are dropped in-kernel via nftables on direct hosts, and blocked at the Cloudflare edge on proxied hosts, so enforcement adds **zero per-request latency**. All security components fail open (a CrowdSec outage stops new bans but never blocks legitimate traffic). Security policies are deployed in audit mode first, then selectively enforced after validation. ## Architecture Diagram +CrowdSec enforcement is out-of-band (NOT an inline Traefik middleware hop). The +Traefik request chain is anti-AI → Authentik ForwardAuth → rate-limit → retry; +CrowdSec drops banned IPs *before* (direct hosts) or *off* (proxied hosts) that +chain entirely. + ```mermaid -graph LR +graph TB Internet[Internet] - CF[Cloudflare WAF] + + subgraph "Proxied hosts (orange-cloud)" + CFedge[Cloudflare edge<br/>WAF rule: ip.src in $crowdsec_ban → block] + end + subgraph "Direct hosts (grey-cloud / internal)" + NFT[Host nftables<br/>table crowdsec/crowdsec6<br/>drop in input + forward] + end + Tunnel[Cloudflared Tunnel] - CrowdSec[CrowdSec Bouncer<br/>Traefik Plugin] - AntiAI[Anti-AI Check<br/>poison-fountain] - ForwardAuth[Authentik ForwardAuth] - RateLimit[Rate Limit Middleware] - Retry[Retry Middleware<br/>2 attempts, 100ms] + Traefik[Traefik<br/>anti-AI → Authentik → rate-limit → retry] Backend[Backend Service] LAPI[CrowdSec LAPI<br/>3 replicas] - Agent[CrowdSec Agent] + Agent[CrowdSec Agent<br/>parses Traefik logs] + FWB[cs-firewall-bouncer<br/>DaemonSet, every node] + CFsync[crowdsec-cf-sync<br/>CronJob, every 2 min] - Internet -->|1| CF - CF -->|2| Tunnel - Tunnel -->|3| CrowdSec - CrowdSec -.->|Query| LAPI - Agent -.->|Report| LAPI - CrowdSec -->|4. Pass/Block| AntiAI - AntiAI -->|5. Human/Bot| ForwardAuth - ForwardAuth -->|6. Authenticated| RateLimit - RateLimit -->|7. Under Limit| Retry - Retry -->|8. Success/Retry| Backend + Internet -->|proxied| CFedge + Internet -->|direct| NFT + CFedge -->|allowed| Tunnel + Tunnel --> Traefik + NFT -->|allowed| Traefik + Traefik --> Backend - style CrowdSec fill:#f9f,stroke:#333 - style AntiAI fill:#ff9,stroke:#333 - style ForwardAuth fill:#9f9,stroke:#333 - style RateLimit fill:#99f,stroke:#333 + Agent -.->|report| LAPI + LAPI -.->|all decisions incl. CAPI| FWB + FWB -.->|program drop rules| NFT + LAPI -.->|ban/captcha decisions, CAPI excluded| CFsync + CFsync -.->|push IP list| CFedge + + style CFedge fill:#f9f,stroke:#333 + style NFT fill:#f9f,stroke:#333 ``` ## Components @@ -44,7 +54,8 @@ graph LR |-----------|---------|----------|---------| | CrowdSec LAPI | Pinned | `stacks/crowdsec/` | Local API, threat intelligence aggregation (3 replicas) | | CrowdSec Agent | Pinned | `stacks/crowdsec/` | Log parser, scenario detection | -| CrowdSec Traefik Bouncer | Plugin | Traefik config | Plugin-based IP reputation check | +| cs-firewall-bouncer | v0.0.34 | `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | In-kernel nftables drop on every node (DIRECT hosts). Bouncer key `firewall` | +| crowdsec-cf-sync | — | `stacks/rybbit/crowdsec_edge.tf` | LAPI→Cloudflare-IP-List sync CronJob (PROXIED hosts). Bouncer key `kvsync` | | Kyverno | Pinned chart | `stacks/kyverno/` | Policy engine for K8s admission control | | poison-fountain | Latest | `stacks/poison-fountain/` | Anti-AI bot detection and tarpit service | | cert-manager/certbot | - | `stacks/cert-manager/` | TLS certificate management | @@ -54,11 +65,15 @@ graph LR ### Request Security Layers -Every incoming request passes through 6 security layers: +CrowdSec IP-reputation enforcement happens **before** a request reaches the +Traefik chain (banned IPs are dropped in-kernel on direct hosts, or blocked at +the Cloudflare edge on proxied hosts — see CrowdSec Threat Intelligence below). +A request that survives that out-of-band gate then passes through the Traefik +middleware chain: -1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external) -2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP -3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error) +1. **Cloudflare WAF / edge** - DDoS protection, bot detection, firewall rules incl. the CrowdSec `crowdsec_ban` block rule (proxied hosts only) +2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP (proxied hosts) +3. **CrowdSec out-of-band drop** - nftables on direct hosts; *not* a Traefik hop (zero per-request latency) 4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17) 5. **Authentik ForwardAuth** - Authentication check (if `protected = true`) 6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach) @@ -80,11 +95,71 @@ CrowdSec operates in a hub-and-agent model: - Reports malicious IPs to LAPI - Shares threat intel with CrowdSec community (anonymized) -**Traefik Bouncer Plugin**: -- Integrated as Traefik middleware -- Queries LAPI for IP reputation on each request -- **Fail-open mode**: If LAPI unreachable, allows traffic (graceful degradation) -- Blocks IPs on ban list, allows others +Enforcement is split across **two out-of-band surfaces**, neither of which adds +any per-request latency. (See "Why the Traefik bouncer plugin was removed" below +for the supersession history — there is no longer an inline Traefik bouncer.) + +**Surface 1 — DIRECT (non-Cloudflare-proxied) hosts → in-kernel nftables drop** +(`cs-firewall-bouncer` DaemonSet, `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf`): +- Runs on **every node** (no nodeSelector). Programs the HOST nftables — `table ip + crowdsec` / `table ip6 crowdsec6` — with drop rules in **both the `input` AND + the `forward` hooks**. The `forward` hook is required because Traefik is a + LoadBalancer with `externalTrafficPolicy=Local`: client traffic is DNAT'd to the + Traefik **pod** and transits the node's `forward` hook (not `input`) with the + real client IP preserved. Chains use `policy accept` (only set members drop — + it can never blackhole normal traffic). +- Pulls **all** decisions from LAPI, **including the CAPI community blocklist + (~31k IPs)**. Packets from banned IPs are dropped **in-kernel before reaching + Traefik** → zero per-request hops, no Traefik involvement at all. +- **Packaging**: cs-firewall-bouncer publishes no container image, so the + **v0.0.34** static binary is fetched at runtime by an initContainer onto a + `debian:bookworm-slim` runtime container. Needs `hostNetwork` + + `NET_ADMIN`/`NET_RAW` to talk netlink directly. Registered bouncer key: + **`firewall`**. +- **Fail-open**: if LAPI is unreachable it just stops receiving new decisions + (existing drop rules persist); it never blocks legitimate traffic. + +**Surface 2 — PROXIED (Cloudflare orange-cloud) hosts → Cloudflare edge block** +(`stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py`): +- Proxied hosts terminate at the Cloudflare edge, so a host-level nftables drop + would never see them. Enforcement is instead a single Cloudflare Rules List + **`crowdsec_ban`** + a zone-scoped WAF custom rule `(ip.src in $crowdsec_ban)` + → **block** action, which covers every proxied host in the zone. +- Fed by the **`crowdsec-cf-sync` CronJob** (namespace `rybbit`, every 2 min, + pure-stdlib Python in a ConfigMap). It pulls local **ban/captcha ip-scoped** + decisions and pushes them into the CF list, but **EXCLUDES the ~31k CAPI + community blocklist** — that set is far too large for a CF Rules List (the CF + account hard-limits to **one** list), and CAPI is already covered in-kernel on + direct hosts and by Cloudflare's own managed protections on proxied hosts. + Registered bouncer key: **`kvsync`**. +- **Block-only**: the single-list limit precludes a separate + captcha/managed-challenge list, so both ban and captcha decisions are enforced + as a plain block at the edge. +- **Auth carve-out:** the WAF rule excludes `authentik.viktorbarzin.me` + + `public-auth.viktorbarzin.me` (`… and not (http.host in {…})`). A CrowdSec hit + must never wall a user out of the login / WebAuthn flow they authenticate + through; auth keeps `traefik-rate-limit` for brute-force protection. + +**Whitelist** (`stacks/crowdsec/whitelist.yaml`): a CrowdSec whitelist covers +RFC1918 + the tailnet + internal CIDRs (plus one specific external IP), so +internal users are never enforced. Internal access uses split-horizon DNS +straight to Traefik, and direct internal clients are RFC1918 — both whitelisted. + +#### Why the Traefik bouncer plugin was removed + +Enforcement used to run as an inline Traefik middleware — the +`crowdsec-bouncer-traefik-plugin` (Yaegi/Lua), which queried LAPI on every +request and could serve a Cloudflare Turnstile captcha for soft remediations. +On **Traefik 3.7.5 the Yaegi handler was never invoked**, so the bouncer was +registered but enforced **nothing** despite appearing healthy. Rather than chase +the Yaegi runtime, the whole plugin path was **removed** (2026-06): the plugin +static config + initContainer download, the `crowdsec` Middleware CRD, the +`captcha.html` template + its ConfigMap and volume mount, and the Cloudflare +Turnstile widget (`cloudflare_turnstile_widget.crowdsec_captcha`). It was +replaced by the two out-of-band surfaces above, which add zero per-request +latency and fail open. (The earlier `crowdsec-cf-sync` cursor-pagination / +IP-List-capacity issues are also moot now that CAPI is excluded from the edge +list and dropped in-kernel instead.) **Metabase** (disabled by default): - Dashboard for CrowdSec analytics @@ -189,7 +264,7 @@ Beads epic: `code-8ywc`. **Status: partially live as of 2026-05-18.** | W1.2 Vault `x_forwarded_for_authorized_addrs = 10.10.0.0/16` | **LIVE** — applied via `tg apply -target=helm_release.vault` on 2026-05-18; all 3 vault pods restarted cleanly | | W1.2 Vault audit log shipping to Loki | **LIVE** — `audit-tail` sidecar in vault pods + Alloy DaemonSet ships to Loki with `container="audit-tail"`. Verified via `{namespace="vault",container="audit-tail"}` LogQL query. | | W1.1 K8s API audit policy + shipping | **LIVE** — kube-apiserver audit policy was already configured (Metadata level, `/var/log/kubernetes/audit.log`, 7d retention). Alloy DaemonSet now tolerates control-plane taint, scrapes the audit log file, ships to Loki with `job=kubernetes-audit`. K2-K9 alert rules in Loki ruler. | -| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7); **S1 PENDING** — fires once promtail/Alloy on PVE host ships sshd journal with `job=sshd-pve`. | +| W1.3 Source-IP anomaly rules (K9, V7, S1) | **LIVE** (K9, V7, S1). **S1 activated 2026-06-10** — promtail on the PVE host now ships the journal to Loki (`scripts/pve-promtail.yaml`); sshd auth lands as `job=sshd-pve` (the S1 data source). The same shipper carries snoopy `execve()` command audit as `{job="pve-journal", identifier="snoopy"}` (forensic, not alerting). Deployed because emo's agent was given root SSH to the host (shared key) — see `docs/architecture/monitoring.md` → "External host: pve". | | W1.4 Kyverno security policies → Enforce | **LIVE** — 3 policies in Enforce mode with 35-namespace exclude list. | | W1.5 Kyverno trusted-registries → Enforce | **LIVE** — explicit allowlist (15 registries + 6 DockerHub library bare names + 56 DockerHub user repos). Verified by admission dry-run: `evilcorp.example/malware:v1` BLOCKED, `alpine:3.20` and `docker.io/library/alpine:3.20` ALLOWED. | | W1.6 Calico observe-phase (pilot: recruiter-responder) | **LIVE** (2026-05-19) — GlobalNetworkPolicy `wave1-egress-observe-recruiter-responder` with rules `[action:Log, action:Allow]`. FelixConfiguration.flowLogsFileEnabled approach abandoned (Calico Enterprise-only field, rejected by OSS v3.26). Log action emits iptables LOG with prefix `calico-packet: ` → kernel → journald → Alloy → Loki. Verified: `{job="node-journal"} \|~ "calico-packet"` returns real packet metadata (SRC/DST/PROTO). Expand to more namespaces by adding to `namespaceSelector`. | @@ -205,7 +280,7 @@ Response model: **(I) Slack-only, daily skim.** All security alerts land in a ne |---|---|---|---| | K8s API audit log | Custom audit policy on kube-apiserver: drop `get`/`list`/`watch` at `None` for most resources, log writes at `Metadata`, secret reads at `Metadata`, `exec`/`portforward` at `RequestResponse`, exclude kubelet+controller-manager noise. Codified in `stacks/infra` kubeadm config templating. | Alloy DaemonSet tails `/var/log/kubernetes/audit/*.log` | `job=kube-audit` | | Vault audit log | `file` audit device on existing Vault PVC. Vault listener config sets `x_forwarded_for_authorized_addrs` trusting Traefik pod CIDR so `remote_addr` is the real client IP, not Traefik's. | Alloy tails audit log file | `job=vault-audit` | -| PVE sshd auth log | journald `_SYSTEMD_UNIT=ssh.service` | promtail systemd unit on Proxmox host (192.168.1.127) | `job=sshd-pve` | +| PVE sshd auth log | journald (`_SYSTEMD_UNIT=ssh.service`, `SYSLOG_IDENTIFIER=sshd-session`); promtail relabels `identifier=~"sshd.*"` → `job=sshd-pve` | promtail systemd unit on Proxmox host (192.168.1.127), `scripts/pve-promtail.yaml` — **LIVE 2026-06-10** | `job=sshd-pve` | | Calico flow log | `flowLogsFileEnabled: true` in Calico Felix config | Alloy (cluster-wide) | `job=calico-flow` (W1.6 only) | #### Alert rules (16 total) @@ -255,6 +330,10 @@ Routed via **Loki ruler → Alertmanager → `#security` Slack receiver**. Same **Policy: no public-IP access ever.** Vault, kube-apiserver, PVE sshd must transit a trusted LAN or Headscale. Anything else fires an alert. +**Documented exception — break-glass SSH (2026-06-11):** one deliberate carve-out. The Proxmox host's sshd listens on a WAN-exposed `:52222` (edge-router forward), **key-only**, trusting only a dedicated break-glass key (`Match LocalPort` → `authorized_keys.breakglass`), rate-limited (iptables hashlimit) + fail2ban. It is intentionally reachable from the public internet so it survives a cluster/tunnel outage with no dependency on the cluster — the one case the "must transit LAN/Headscale" rule cannot serve. Brute-force-proof (no password); the trade is Shodan-visibility. As-built: `docs/runbooks/breakglass-ssh.md`; rationale: `docs/plans/2026-06-11-breakglass-ssh-redesign-design.md`. (Replaced the 2026-05-30 port-knock variant, which was non-scannable but had a circular Vault dependency that caused a lockout.) + +**Two privileged footholds for the warm break-glass UI (2026-06-12):** the in-cluster `claude-breakglass` service (`breakglass.viktorbarzin.me`, warm case = devvm wedged, cluster healthy) holds one ed25519 key (Vault `secret/claude-breakglass/ssh_key`) authorising: (1) a `breakglass` user on the **devvm** with NOPASSWD sudo (`from="10.0.20.0/24"` — the Calico-SNAT node subnet); (2) a **PVE** `authorized_keys` entry pinned to `command="/usr/local/bin/breakglass-pve",restrict,from="192.168.1.2"` (pfSense's inter-VLAN SNAT IP) that only runs the verbs `status|forensics|reset|stop|start|cycle` against VM 102. The key is reachable ONLY by the breakglass pod (own namespace, no Vault role, ESO-synced); the shared `claude-agent` pod's `terraform-state` Vault policy is explicitly DENIED `secret/claude-breakglass/*`. Reset is autonomous (the agent may fire it), forensics-first. Reachable via Authentik or the basic-auth fallback — LAN-routed, not WAN-exposed. Runbook: `docs/runbooks/breakglass-ui.md`; ADR: `claude-agent-service/docs/adr/0001-breakglass-security-architecture.md`. + #### Why no canary tokens Original plan included canary tokens (fake K8s Secret, Vault KV path, PVE file, sinkhole hostname). Rejected because Viktor routinely greps `secret/viktor` (135 keys) and lists `kubectl get secret -A` — any read-trigger canary self-fires. Use-based canaries (zero-RBAC SA tokens with audit alerts on use) were also considered but rejected in favor of cleaner source-IP anomaly detection (K9, V7) on REAL tokens — same threat model, no fake-token operational burden. @@ -326,10 +405,12 @@ Beads: `code-8ywc` W1.6 + W1.7. **Status: planned.** | Path | Purpose | |------|---------| -| `stacks/crowdsec/` | CrowdSec LAPI, agent, bouncer config | +| `stacks/crowdsec/` | CrowdSec LAPI, agent config + `whitelist.yaml` | +| `stacks/crowdsec/modules/crowdsec/firewall_bouncer.tf` | cs-firewall-bouncer DaemonSet (in-kernel nftables drop, direct hosts) | +| `stacks/rybbit/crowdsec_edge.tf` + `lapi_kv_sync.py` | Cloudflare IP-List + WAF block rule + LAPI→CF sync CronJob (proxied hosts) | | `stacks/kyverno/` | Kyverno deployment + policies | | `stacks/poison-fountain/` | Anti-AI service + CronJob | -| `stacks/platform/modules/traefik/middleware.tf` | Security middleware definitions | +| `stacks/traefik/modules/traefik/middleware.tf` | Security middleware definitions (no longer includes a CrowdSec bouncer) | | `stacks/platform/modules/ingress_factory/` | Per-service security toggles | ### Vault Paths @@ -439,7 +520,11 @@ spec: **Fix**: 1. Check LAPI decisions: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions list` 2. Remove ban: `kubectl exec -it crowdsec-lapi-0 -- cscli decisions delete --ip <IP>` -3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` + — the in-kernel drop clears as soon as `cs-firewall-bouncer` reconciles (direct + hosts); for proxied hosts the `crowdsec-cf-sync` CronJob removes it from the + `crowdsec_ban` CF list within ~2 min. +3. Whitelist if needed: Add to `stacks/crowdsec/whitelist.yaml` (RFC1918 + tailnet + + internal CIDRs are already whitelisted, so internal clients are never banned). ### Kyverno Policy Blocking Deployment diff --git a/docs/architecture/storage.md b/docs/architecture/storage.md index 486246a6..4b501598 100644 --- a/docs/architecture/storage.md +++ b/docs/architecture/storage.md @@ -17,7 +17,7 @@ All services storing sensitive data were migrated to `proxmox-lvm-encrypted` on - **HDD NFS**: `/srv/nfs` on ext4 LV `pve/nfs-data` (4TB) — bulk media and backup targets - **SSD NFS**: `/srv/nfs-ssd` on ext4 LV `ssd/nfs-ssd-data` (100GB) — high-performance data (Immich ML) -Both `StorageClass: nfs-truenas` and `StorageClass: nfs-proxmox` point to the Proxmox host and are functionally identical. The `nfs-truenas` name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. +`StorageClass: nfs-truenas` is the **only** NFS StorageClass and points to the Proxmox host. The name is historical — it was retained because StorageClass names are immutable on bound PVs (48 PVs reference it) and renaming would force mass PV churn across the cluster. (A short-lived parallel `nfs-proxmox` StorageClass was removed on 2026-04-25, commit 484b4c71, during the vault NFS-hostile migration.) **Backup storage (sda)**: 1.1TB RAID1 SAS disk, VG `backup`, LV `data` (ext4), mounted at `/mnt/backup` on PVE host. Dedicated backup disk for weekly PVC file backups, auto SQLite backups, pfSense backups, and PVE config. NFS data syncs directly to Synology via inotify change tracking (not stored on sda). Independent of live storage (sdc). @@ -47,7 +47,7 @@ graph TB end subgraph K8s["Kubernetes Cluster"] - CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-proxmox (+ legacy nfs-truenas)<br/>soft,timeo=30,retrans=3"] + CSI_NFS["nfs-csi driver<br/>StorageClass: nfs-truenas (historical name)<br/>soft,timeo=30,retrans=3"] CSI_PVE["Proxmox CSI plugin<br/>StorageClass: proxmox-lvm<br/>StorageClass: proxmox-lvm-encrypted"] NFS_PV["NFS PersistentVolumes<br/>RWX, ~100 volumes"] @@ -85,8 +85,7 @@ graph TB | Proxmox NFS (HDD) | LV `pve/nfs-data`, 4TB ext4 | 192.168.1.127:/srv/nfs | Bulk NFS data for all services | | Proxmox NFS (SSD) | LV `ssd/nfs-ssd-data`, 100GB ext4 | 192.168.1.127:/srv/nfs-ssd | High-performance data (Immich ML) | | nfs-csi | Helm chart | Namespace: nfs-csi | NFS CSI driver | -| StorageClass `nfs-proxmox` | RWX, soft mount | Cluster-wide | NFS storage, points to Proxmox host | -| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | **Historical name** — functionally identical to `nfs-proxmox`, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. | +| StorageClass `nfs-truenas` | RWX, soft mount | Cluster-wide | The only NFS StorageClass — **historical name**, points to the Proxmox host. Kept because SC names are immutable on 48 bound PVs. (Sibling `nfs-proxmox` SC removed 2026-04-25, commit 484b4c71.) | | TF module `nfs_volume` | `modules/kubernetes/nfs_volume/` | Infra repo | Static NFS PV/PVC factory | | ~~TrueNAS VM~~ | **DECOMMISSIONED 2026-04-13** | Was VM 9000 at 10.0.10.15 | Replaced by Proxmox NFS. VM still in stopped state pending deletion. | | ~~democratic-csi-iscsi~~ | **REMOVED** | Was namespace: iscsi-csi | Replaced by Proxmox CSI (2026-04-02) | @@ -113,7 +112,7 @@ graph TB **Note**: Some legacy PVs still reference `/mnt/main/<service>` paths. These work via compatibility symlinks/bind-mounts on the Proxmox host. New PVs should use `/srv/nfs/<service>` or `/srv/nfs-ssd/<service>`. -**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-proxmox` StorageClass (or the legacy `nfs-truenas` for existing PVs) via PVCs. +**CRITICAL**: Never use inline `nfs {}` blocks in pod specs — they default to `hard,timeo=600` which causes 10-minute hangs on network issues. Always use the `nfs-truenas` StorageClass (historical name; it points at the Proxmox host) via PVCs. ### Block Storage Flow (Proxmox CSI) — NEW diff --git a/docs/plans/2026-05-30-breakglass-ssh-access-design.md b/docs/plans/2026-05-30-breakglass-ssh-access-design.md new file mode 100644 index 00000000..1b8b2070 --- /dev/null +++ b/docs/plans/2026-05-30-breakglass-ssh-access-design.md @@ -0,0 +1,285 @@ +# Break-Glass SSH Access — Design + +> **⚠️ SUPERSEDED 2026-06-11** by `2026-06-11-breakglass-ssh-redesign-design.md`. +> The port-knock was removed: it added no real security (the SSH key already +> makes the port brute-force-proof) and its knock sequence lived only in +> in-cluster Vault — unreachable in the exact cold/away scenario break-glass +> exists for, which caused a real lockout. Retained for history. As-built: +> `docs/runbooks/breakglass-ssh.md`. + +- **Date**: 2026-05-30 +- **Status**: Draft — pending user review +- **Owner**: Viktor +- **Related**: `docs/architecture/vpn.md`, `docs/architecture/security.md`, `infra/.claude/CLAUDE.md` (Security Posture Wave 1) + +## 1. Goal + +Provide a **cold, brute-force-proof backdoor onto the home LAN from the public +internet** for the case where the Kubernetes cluster and every cluster-hosted +remote-access path are down (cloudflared, Headscale/Tailscale, in-cluster +WireGuard), but the **Proxmox host, pfSense, and the edge router are still up**. + +### Hard requirements (from the user) + +1. **Cold-survivable**: must work when the k8s cluster + all its tunnels are + down. The path must touch **nothing in the cluster** (no Authentik, Traefik, + Technitium/AdGuard DNS, cloudflared). +2. **Full LAN access** once connected (SSH to Proxmox host, pfSense, Synology, + k8s API, etc.). +3. **No brute force**: no password-guessable surface. +4. **Client uses only software pre-installed on Linux/macOS** — no WireGuard / + Tailscale / fwknop client install. Stock `ssh` (+ `bash`) only. +5. **Minimal effort**, and ideally **honor the locked Wave 1 policy** + (`no public-IP access — … PVE sshd must transit LAN or Headscale`). + +## 2. Decision + +**Key-only SSH to the Proxmox host, gated behind a UDP port-knock.** + +- The Proxmox host (`192.168.1.127`) is the entry point — it's the recovery box + (`virsh`/`qm` to reboot the pfSense VM, `kubectl`, full hypervisor control) + and it sits directly on the `192.168.1.0/24` segment, so the path **does not + traverse pfSense or the cluster** — it survives a wedged pfSense too, not just + a down cluster. +- SSH is the only externally-usable remote tool **pre-installed on every + Linux/macOS box**, satisfying requirement 4. +- **Key-only auth** (no passwords anywhere) makes password brute force + impossible → requirement 3. +- A **port-knock** keeps the external SSH port **closed/invisible to scanners** + until a knock sequence is sent. This restores the "no standing public service" + property we'd have had with WireGuard and keeps us within the **intent** of the + Wave 1 policy (PVE sshd is not internet-scannable). The knock is sent with a + **bash `/dev/udp` one-liner** — zero install. + +### Alternatives rejected + +| Option | Why rejected | +|---|---| +| WireGuard road-warrior on pfSense | Needs a WireGuard **client app** (fails requirement 4). Was the prior design. | +| Tailscale / Headscale | Client app + control plane is in-cluster (dies cold). | +| Browser → web admin UI (Proxmox/pfSense/Synology) | "Pre-installed" (browser) but password-based → brute-forceable, far larger attack surface than a key-only SSH port. | +| Plain **exposed** key-only SSH (no knock) | Brute-force-proof, but a **publicly visible** service (Shodan-catalogued) and a standing violation of the Wave 1 "no public PVE sshd" policy. The knock removes the standing exposure for ~15 min more setup. | +| fwknop / cryptographic SPA | Strongest hiding, but needs a **client install** (fails requirement 4). | + +## 3. Architecture + +``` + Your laptop (anywhere) — stock ssh + bash, nothing installed + │ (1) UDP knock sequence → bash: echo > /dev/udp/<pub>/<port> (instant, no handshake) + │ (2) ssh -p 52222 root@<pub> + ▼ + Edge router 192.168.1.1 (the box the stored password unlocks) + │ forwards: UDP <k1>,<k2>,<k3> + TCP 52222 → 192.168.1.127 + ▼ + Proxmox host 192.168.1.127 ← path bypasses pfSense entirely + ├─ knockd (libpcap) sees the UDP knock → opens TCP 52222 for your source IP (30 s) + ├─ sshd listens on :22 (LAN admin, always) AND :52222 (external, knock-gated), key-only + └─ once in: virsh/qm (reboot pfSense VM), kubectl, ssh -J / ssh -D → full LAN +``` + +**Why it meets "cold + full LAN":** the host is up by definition of the chosen +failure mode; nothing in the path depends on k8s, pfSense, or DNS. From the host +you reach the whole LAN either directly (it's on `192.168.1.0/24` and routes to +the VLANs via pfSense when pfSense is up) or by using SSH's built-in +`-J`/`-D` — both stock, no install. + +## 4. Components + +### 4.1 Edge router @ 192.168.1.1 (manual, in the browser) +Add port-forwards (same place the existing `51821` WireGuard forward lives): +- **TCP 52222 → 192.168.1.127:52222** (external SSH; no port rewrite — see §4.3 rationale) +- **UDP `<k1>`, `<k2>`, `<k3>` → 192.168.1.127** (knock ports; actual numbers in Vault) + +If the router supports a **port range** forward, a single range covering the +knock ports + 52222 is tidier than four rules. + +> **Verify (#1 implementation check):** whether `.1` **preserves the source IP** +> on forwarded packets (typical DNAT) or **SNATs** them to `192.168.1.1`. Test by +> knocking + connecting from an external network and checking `/var/log/auth.log` +> + `knockd` syslog for the observed source IP. The design works either way (see +> §4.3), but it determines knock granularity. + +### 4.2 SSH keys & Vault layout +- Mint a **dedicated** break-glass keypair (ed25519), separate from + `secret/viktor/proxmox_ssh_key`, so it's independently revocable and clearly + labelled. +- **Public key** → `/root/.ssh/authorized_keys` on the Proxmox host (no `from=` + restriction — break-glass is from-anywhere; the knock + key are the gate). +- **Private key** → Vault `secret/viktor/breakglass_ssh_privkey` (for + re-provisioning) **and** on your laptop at `~/.ssh/breakglass_ed25519` + (chmod 600). +- **Knock sequence** → Vault `secret/viktor/breakglass_knock_sequence` (kept out + of git — obscurity value only; see §5). + +### 4.3 Proxmox host — sshd hardening +`/etc/ssh/sshd_config.d/10-breakglass.conf`: +``` +Port 22 +Port 52222 +PasswordAuthentication no +KbdInteractiveAuthentication no +PubkeyAuthentication yes +PermitRootLogin prohibit-password # key-only root (PVE recovery norm) +MaxAuthTries 3 +LoginGraceTime 20 +``` +- sshd listens on **:22 (LAN admin, always allowed)** and **:52222 (external, + knock-gated)**. Using a dedicated external port (not a DNAT rewrite to 22) + lets the firewall distinguish LAN vs external **regardless of `.1` SNAT + behaviour** (§4.1) — LAN admin on `:22` is never affected by the gate. +- **Default to root key-only** for recovery practicality. *Alternative for + review:* a dedicated `breakglass` sudo user instead of root. + +> **Verify (#2):** key login already works for your normal access **before** +> `PasswordAuthentication no` is committed — no lockout. (Backup rsync jobs +> already use keys, so this is likely already effectively true.) + +### 4.4 Host firewall (knock gate) +Default-drop the external SSH port; knockd punches a per-source hole. LAN admin +(`:22`) and established sessions are untouched: +``` +# allow established / related +iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT +# LAN admin + backups: SSH on :22 always allowed +iptables -A INPUT -p tcp --dport 22 -j ACCEPT +# external SSH on :52222 closed by default — knockd opens it per-source +iptables -A INPUT -p tcp --dport 52222 -j DROP +``` +- **knockd uses libpcap**, so it sees the UDP knock packets even though iptables + drops them — the knock ports stay **silent/closed** to scanners. +- **pve-firewall coexistence (verify #3):** confirm whether the PVE firewall is + enabled. If it is, express these rules through it (or a dedicated chain) so a + pve-firewall reload doesn't wipe the knockd-managed rule. Default PVE installs + often have it off at datacenter level. + +### 4.5 knockd +`apt install knockd` (Debian/PVE). `/etc/knockd.conf`: +``` +[options] + UseSyslog + Interface = vmbr0 # the 192.168.1.127 interface + +[breakglass] + sequence = <k1>:udp,<k2>:udp,<k3>:udp # real ports from Vault + seq_timeout = 10 + start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT + cmd_timeout = 30 + stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT +``` +- **UDP knock** → the client knock is fire-and-forget (`/dev/udp`), no TCP-hang + on the client (a TCP knock to a dropped port would block until timeout). +- Opens `:52222` for the knocker's source IP for **30 s**; an SSH session + established within that window **persists** via conntrack ESTABLISHED after the + rule is removed. Enable + start the `knockd` service. + +### 4.6 fail2ban (defense-in-depth) +`apt install fail2ban`, sshd jail (watches `auth.log`, bans repeat failures). +Local to the host, **no cluster dependency**. Catches anything that gets past the +knock to the sshd listener. + +### 4.7 Client side (laptop — stock tools only) +`~/.ssh/config`: +``` +Host breakglass + HostName <public-ip-or-dyndns> + Port 52222 + User root + IdentityFile ~/.ssh/breakglass_ed25519 +``` +Knock + connect — a shell function using **bash builtins only** (works on +macOS `/bin/bash` + Linux; UDP send is instant): +```sh +bg() { + local host=<public-ip-or-dyndns> + for p in <k1> <k2> <k3>; do echo -n x > "/dev/udp/$host/$p"; sleep 0.4; done + sleep 0.5 + ssh breakglass "$@" +} +``` +- **Full LAN, no install:** `ssh -J breakglass <internal-host>` (jump), or + `ssh -D 1080 breakglass` then point a browser/`curl` at SOCKS5 `127.0.0.1:1080` + to reach any internal IP. From the host shell you already have everything. +- *Optional fully-transparent variant:* fold the knock into a `ProxyCommand` in + the `Host breakglass` block so plain `ssh breakglass` knocks automatically. + +### 4.8 Cold-scenario IP cheat sheet (DNS is down when the cluster is down) +Technitium + AdGuard are in-cluster, so `.lan` resolution is gone in a cold +event. Use IPs: + +| Host | IP | +|---|---| +| Proxmox host | `192.168.1.127` (also `10.0.10.1` VLAN10) | +| pfSense | `10.0.20.1` (WAN `192.168.1.2`) | +| k8s API server | `10.0.20.100` | +| Synology NAS | `192.168.1.13` | +| Edge router | `192.168.1.1` | +| Traefik LB / MetalLB | `10.0.20.200` / `10.0.20.203` | + +## 5. Security analysis + +- **Brute force: solved.** No password auth anywhere → password guessing is + impossible; key brute force is cryptographically infeasible. +- **Invisibility / Wave 1 intent: satisfied.** The external SSH port is + default-dropped and the knock ports are pcap-sniffed (never answered), so a + scanner sees a closed/silent host — PVE sshd is **not internet-scannable**, + honouring the spirit of "no public-IP access to PVE sshd". +- **The knock is obscurity, not cryptography.** A port-knock sequence is + plaintext and replayable by a passive on-path observer. **The SSH key is the + real access control** — the knock only removes the standing/scannable surface. + (Cryptographic SPA = fwknop, rejected for needing a client install.) Treat the + knock sequence as a secret-ish convenience, not a second cryptographic factor. +- **Residual risks** (none are brute force): + 1. An sshd **0-day** exploitable during the 30 s open window → mitigation: keep + PVE patched; short `cmd_timeout`; fail2ban. + 2. **Private key theft** → mitigation: key has a passphrase; revoke by removing + the line from `authorized_keys`. + 3. If `.1` **SNATs** (§4.1), the 30 s window opens `:52222` for the shared + `192.168.1.1` source — anyone else arriving via `.1` in that window could + reach the sshd banner, but still needs your key. Mitigated by the short + window + key-only + fail2ban. +- **Deliberate, documented exception** to the Wave 1 "no public-IP access" + policy, scoped to this single knock-gated port. To be recorded in + `security.md` + the Wave 1 note in `infra/.claude/CLAUDE.md` on implementation. + +## 6. What's automated vs manual + +- **I do**: generate the keypair + knock sequence, store them in Vault, produce + the exact `sshd_config.d` snippet, `knockd.conf`, iptables rules, the client + `~/.ssh/config` + `bg()` function, and write the runbook + doc updates. +- **Manual / careful (live devices)**: the `.1` edge-router forwards are done by + you in the browser (out-of-Terraform, live device). The Proxmox host changes + (sshd, knockd, iptables, fail2ban) are applied over SSH **with key-login + verified first** to avoid lockout; pfSense is **not** touched. None of this is + a `tg apply` — pfSense and the edge router are not Terraform-managed. + +## 7. Testing & verification +1. From an **external** network (phone hotspot): run `bg`; confirm knockd syslog + shows the sequence + opens `:52222`; SSH succeeds. +2. **Without** knocking: `ssh -p 52222` from external → connection refused/timed + out (port closed). A plain port scan of `52222` + the knock ports → silent. +3. LAN admin on `:22` still works (no regression); backup rsync jobs unaffected. +4. Full-LAN: `ssh -J breakglass 10.0.20.1` (pfSense) and `ssh -D 1080` SOCKS to + an internal IP. +5. Determine `.1` source-IP behaviour (verify #1) and adjust knock granularity + note accordingly. + +## 8. Failure modes & rotation +- **Proxmox host down** (not just cluster): this path is gone — that's the + out-of-band tier (serial/IPMI/separate device), explicitly **out of scope**. +- **`.1` router config reset**: forwards lost → re-add from this doc; consider + exporting the `.1` config for backup. +- **Public IP change**: use a hostname endpoint (Cloudflare-resolved) so it + auto-follows; keep the raw IP as fallback. +- **Key/knock compromise**: remove the `authorized_keys` line (kills access + instantly); rotate the knock sequence in `knockd.conf` + Vault. + +## 9. Out of scope +- Host-down / site-down out-of-band access (IPMI, LTE) — a future tier. +- Phone access (would need an SSH **app**, e.g. Termius — outside the + "pre-installed Linux/macOS" constraint; laptop is the target). + +## 10. Docs to update on implementation +- `docs/architecture/vpn.md` — add a "Break-glass SSH" section. +- `docs/architecture/security.md` + Wave 1 note in `infra/.claude/CLAUDE.md` — + record the deliberate knock-gated exception to "no public PVE sshd". +- New runbook `docs/runbooks/breakglass-ssh.md` — connect + rotate procedure. diff --git a/docs/plans/2026-05-30-breakglass-ssh-access-plan.md b/docs/plans/2026-05-30-breakglass-ssh-access-plan.md new file mode 100644 index 00000000..c4db48e2 --- /dev/null +++ b/docs/plans/2026-05-30-breakglass-ssh-access-plan.md @@ -0,0 +1,395 @@ +# Break-Glass SSH Access — Implementation Plan + +> **⚠️ SUPERSEDED 2026-06-11** by the redesign in +> `2026-06-11-breakglass-ssh-redesign-design.md` (port-knock removed). Retained +> for history. As-built: `docs/runbooks/breakglass-ssh.md`. + +> **Execution model:** This plan mutates **live devices** (the Proxmox host's sshd, and the TP-Link edge router). It is **human-gated**, NOT for autonomous subagents. Each live step is applied with anti-lockout verification, and every edge-router change is made by Viktor (or by the browse tool with explicit per-change approval). Steps use `- [ ]` checkboxes. + +**Goal:** Stand up a cold, brute-force-proof SSH backdoor onto the LAN — key-only SSH to the Proxmox host (`192.168.1.127`) gated behind a UDP port-knock — then decommission the legacy Synology SSH exposure and tighten UPnP. + +**Architecture:** Edge router `.1` forwards a UDP knock sequence + TCP `52222` to the Proxmox host. The host runs `knockd` (libpcap) which opens `52222` for the knocker's IP for 30 s; `sshd` listens on `:22` (LAN, always) and `:52222` (external, knock-gated), key-only. Path bypasses pfSense + the k8s cluster. Client uses only stock `ssh` + `bash`. + +**Tech stack:** OpenSSH, knockd, iptables, fail2ban (Debian/PVE host); TP-Link Archer AX6000 UI (edge router); HashiCorp Vault (secrets); Docker (`/home/wizard/tools/insecure-browse` for any router automation). + +**Reference:** design doc `2026-05-30-breakglass-ssh-access-design.md`. Router audit (current `.1` forwards) recorded in task notes + `/home/wizard/tools/insecure-browse/out/`. + +--- + +## Pre-flight (read before starting) + +- **Anti-lockout rule:** never disable password auth or reload sshd without an *already-open* root session held + a *new* session verified. Applies to every host step. +- **Live-router rule:** all `.1` changes are made by Viktor in the UI (or browse-tool with explicit approval). No blind automation of router writes. +- **Ordering rule:** the legacy Synology SSH forward (Rule 6) is **not** closed until break-glass is verified working from an external network (Phase 4 gates on Phase 4-pre verification). +- **Host access:** PVE host reached as `ssh root@192.168.1.127` from the LAN. +- **Commit gate:** the infra repo currently has unmerged conflicts + an in-progress provider/backend migration. Do NOT commit (Phase 6) until Viktor confirms the repo is clean. + +--- + +## Phase 0 — Generate secrets (no live changes) + +### Task 0.1: Break-glass SSH keypair + +**Files:** none in repo (secrets → Vault). + +- [ ] **Step 1: Generate a dedicated ed25519 keypair (with passphrase)** + +```bash +mkdir -p ~/.ssh +ssh-keygen -t ed25519 -a 100 -C "breakglass-$(date +%Y%m%d)" -f ~/.ssh/breakglass_ed25519 +# set a passphrase when prompted (so a stolen laptop key isn't instantly usable) +``` + +- [ ] **Step 2: Store the private key + public key in Vault** + +```bash +vault kv patch secret/viktor \ + breakglass_ssh_privkey=@$HOME/.ssh/breakglass_ed25519 \ + breakglass_ssh_pubkey="$(cat ~/.ssh/breakglass_ed25519.pub)" +``` + +- [ ] **Step 3: Verify the keys are retrievable** + +```bash +vault kv get -field=breakglass_ssh_pubkey secret/viktor +``` +Expected: prints the `ssh-ed25519 AAAA... breakglass-YYYYMMDD` line. + +### Task 0.2: Knock sequence + +- [ ] **Step 1: Generate 3 random UDP knock ports** + +```bash +KNOCK="$(shuf -i 20000-60000 -n 3 | paste -sd, -)"; echo "$KNOCK" +``` + +- [ ] **Step 2: Store the sequence in Vault (keep it out of git)** + +```bash +vault kv patch secret/viktor breakglass_knock_sequence="$KNOCK" +vault kv get -field=breakglass_knock_sequence secret/viktor +``` +Expected: prints three comma-separated ports, e.g. `28411,49027,33180`. + +--- + +## Phase 1 — Proxmox host: key-only SSH + knock gate (LIVE host change) + +> Run everything in this phase **on the PVE host**. Keep your current `ssh root@192.168.1.127` session open the entire phase. + +### Task 1.1: Pre-checks (no changes yet) + +- [ ] **Step 1: Confirm key login already works (anti-lockout baseline)** + +From your laptop, with the break-glass key authorized later — for now confirm your *existing* admin key works: +```bash +ssh -o PasswordAuthentication=no root@192.168.1.127 'echo KEY_LOGIN_OK' +``` +Expected: `KEY_LOGIN_OK` (key auth works → safe to disable passwords later). If it prompts for a password, STOP and fix key auth first. + +- [ ] **Step 2: Check whether the PVE firewall is active (coexistence)** + +```bash +ssh root@192.168.1.127 'pve-firewall status 2>/dev/null; iptables -S | head' +``` +Expected: note whether `Status: enabled/running`. If **enabled**, add the Phase-1.4 rules via PVE's firewall (Datacenter→Firewall) instead of raw iptables, OR disable it if unused. If **disabled** (common), proceed with the raw-iptables approach below. + +### Task 1.2: Authorize the break-glass key + +- [ ] **Step 1: Append the break-glass public key to root's authorized_keys** + +```bash +PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)" +ssh root@192.168.1.127 "grep -qF '$PUB' /root/.ssh/authorized_keys || echo '$PUB' >> /root/.ssh/authorized_keys" +``` + +- [ ] **Step 2: Verify break-glass key logs in (on :22, still default)** + +```bash +ssh -i ~/.ssh/breakglass_ed25519 -o PasswordAuthentication=no root@192.168.1.127 'echo BREAKGLASS_KEY_OK' +``` +Expected: `BREAKGLASS_KEY_OK`. + +### Task 1.3: sshd dual-port + key-only + +**Files:** Create on host: `/etc/ssh/sshd_config.d/10-breakglass.conf` + +- [ ] **Step 1: Write the sshd drop-in** + +```bash +ssh root@192.168.1.127 'cat > /etc/ssh/sshd_config.d/10-breakglass.conf' <<'EOF' +Port 22 +Port 52222 +PasswordAuthentication no +KbdInteractiveAuthentication no +PubkeyAuthentication yes +PermitRootLogin prohibit-password +MaxAuthTries 3 +LoginGraceTime 20 +EOF +``` + +- [ ] **Step 2: Validate config syntax (do NOT reload yet)** + +```bash +ssh root@192.168.1.127 'sshd -t && echo SSHD_CONFIG_OK' +``` +Expected: `SSHD_CONFIG_OK`. If error, fix the drop-in before reloading. + +- [ ] **Step 3: Reload sshd (current session stays alive)** + +```bash +ssh root@192.168.1.127 'systemctl reload ssh && echo RELOADED' +``` +Expected: `RELOADED`. + +- [ ] **Step 4: Verify a NEW key session works on :22 AND :52222 before trusting it** + +```bash +ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo OK22' +ssh -i ~/.ssh/breakglass_ed25519 -p 52222 root@192.168.1.127 'echo OK52222' +``` +Expected: `OK22` and `OK52222`. (If `:52222` refuses, sshd may not have bound the second port — check `ss -tlnp | grep ssh` on the host.) Only after both succeed, the old session is safe to drop. + +### Task 1.4: Base firewall (default-drop :52222, allow :22 + established) + +**Files:** Create on host: `/usr/local/sbin/breakglass-firewall.sh`, `/etc/systemd/system/breakglass-firewall.service` + +- [ ] **Step 1: Write the idempotent base-firewall script (dedicated chain)** + +```bash +ssh root@192.168.1.127 'cat > /usr/local/sbin/breakglass-firewall.sh' <<'EOF' +#!/usr/bin/env bash +set -euo pipefail +# Idempotent: (re)build a dedicated BREAKGLASS chain hooked into INPUT. +iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS +iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS +# established/related always allowed +iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT +# LAN admin on :22 always allowed (.1 does NOT forward :22 to this host, so :22 is LAN-only) +iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT +# external SSH on :52222 closed by default; knockd punches a per-source ACCEPT into INPUT pos 1 +iptables -A BREAKGLASS -p tcp --dport 52222 -j DROP +EOF +ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh' +``` + +- [ ] **Step 2: Write a boot-time systemd unit (persists across reboot, before knockd)** + +```bash +ssh root@192.168.1.127 'cat > /etc/systemd/system/breakglass-firewall.service' <<'EOF' +[Unit] +Description=Break-glass base firewall (SSH knock gate) +After=network-pre.target +Before=knockd.service +Wants=network-pre.target + +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/breakglass-firewall.sh +RemainAfterExit=yes + +[Install] +WantedBy=multi-user.target +EOF +ssh root@192.168.1.127 'systemctl daemon-reload && systemctl enable --now breakglass-firewall.service && echo FW_APPLIED' +``` +Expected: `FW_APPLIED`. + +- [ ] **Step 3: Verify LAN :22 still works and :52222 is now dropped from LAN** + +```bash +ssh -i ~/.ssh/breakglass_ed25519 -p 22 root@192.168.1.127 'echo STILL_OK22' # works +nc -z -w3 192.168.1.127 52222 && echo "OPEN(bad)" || echo "CLOSED_AS_EXPECTED" # closed pre-knock +``` +Expected: `STILL_OK22` and `CLOSED_AS_EXPECTED`. + +### Task 1.5: knockd + +**Files:** Create/modify on host: `/etc/knockd.conf`, `/etc/default/knockd` + +- [ ] **Step 1: Install knockd (host daemon — must be native, not Docker, to manage host iptables)** + +```bash +ssh root@192.168.1.127 'apt-get update -qq && apt-get install -y knockd && echo KNOCKD_INSTALLED' +``` +Expected: `KNOCKD_INSTALLED`. + +- [ ] **Step 2: Write knockd.conf with the Vault knock sequence (UDP)** + +```bash +KNOCK="$(vault kv get -field=breakglass_knock_sequence secret/viktor)" # e.g. 28411,49027,33180 +read K1 K2 K3 <<<"$(echo "$KNOCK" | tr ',' ' ')" +ssh root@192.168.1.127 "cat > /etc/knockd.conf" <<EOF +[options] + UseSyslog + Interface = vmbr0 + +[breakglass] + sequence = ${K1}:udp,${K2}:udp,${K3}:udp + seq_timeout = 10 + start_command = /usr/sbin/iptables -I INPUT 1 -s %IP% -p tcp --dport 52222 -j ACCEPT + cmd_timeout = 30 + stop_command = /usr/sbin/iptables -D INPUT -s %IP% -p tcp --dport 52222 -j ACCEPT +EOF +``` + +- [ ] **Step 3: Enable + start knockd** + +```bash +ssh root@192.168.1.127 "sed -i 's/^START_KNOCKD=.*/START_KNOCKD=1/' /etc/default/knockd 2>/dev/null || echo 'START_KNOCKD=1' >> /etc/default/knockd" +ssh root@192.168.1.127 'systemctl enable --now knockd && systemctl is-active knockd' +``` +Expected: `active`. + +### Task 1.6: fail2ban (defense-in-depth) + +- [ ] **Step 1: Install + enable fail2ban with the default sshd jail** + +```bash +ssh root@192.168.1.127 'apt-get install -y fail2ban && systemctl enable --now fail2ban && fail2ban-client status sshd >/dev/null && echo F2B_OK' +``` +Expected: `F2B_OK` (sshd jail active). + +--- + +## Phase 2 — Edge router `.1` forwards (LIVE router change — Viktor executes) + +> In the AX6000 UI: **Advanced → NAT Forwarding → Port Forwarding → Add**. Do NOT remove anything yet. + +- [ ] **Step 1: Add the SSH break-glass forward** + - Name `breakglass-ssh`, External Port `52222`, Internal IP `192.168.1.127`, Internal Port `52222`, Protocol `TCP`, Enable. + +- [ ] **Step 2: Add the three UDP knock forwards** (values from `vault kv get -field=breakglass_knock_sequence secret/viktor`) + - For each of the 3 ports: Name `bg-knock-N`, External Port `<port>`, Internal IP `192.168.1.127`, Internal Port `<same port>`, Protocol `UDP`, Enable. + +- [ ] **Step 3: (verify #1) Determine whether `.1` preserves source IP or SNATs** + +After Phase 3 connects once, on the host check the observed source: +```bash +ssh root@192.168.1.127 'journalctl -u knockd -n 20 --no-pager | grep -i "stage\|open"' +``` +If `%IP%` is a public IP → source preserved (per-IP granularity). If it's `192.168.1.1` → `.1` SNATs (knock opens `:52222` for the shared `.1` source during the 30 s window). Both are acceptable with the dual-port + key-only model; just note it in the runbook. + +--- + +## Phase 3 — Client config (laptop, no live infra change) + +**Files:** Modify `~/.ssh/config`; add a shell function to `~/.zshrc`/`~/.bashrc`. + +- [ ] **Step 1: Add the SSH host block** + +```bash +cat >> ~/.ssh/config <<'EOF' + +Host breakglass + HostName viktorbarzin.ddns.net + Port 52222 + User root + IdentityFile ~/.ssh/breakglass_ed25519 +EOF +``` +(`viktorbarzin.ddns.net` is the router's NO-IP DDNS name — follows the dynamic WAN IP. Raw IP `176.12.22.76` is the fallback.) + +- [ ] **Step 2: Add the knock+connect function** + +```bash +cat >> ~/.zshrc <<'EOF' + +bg() { + local host="viktorbarzin.ddns.net" + local seq; seq="$(vault kv get -field=breakglass_knock_sequence secret/viktor 2>/dev/null || echo "")" + [ -z "$seq" ] && { echo "no knock sequence (vault?)"; return 1; } + for p in ${seq//,/ }; do (exec 3<>/dev/udp/$host/$p) 2>/dev/null && echo "x" >&3; sleep 0.4; done + sleep 0.5 + ssh breakglass "$@" +} +EOF +``` +> Note: the bash `/dev/udp` redirection works under bash (`/bin/bash` on macOS + Linux). Under zsh, `/dev/udp` is also supported by zsh's builtin in recent versions; if your zsh build lacks it, define `bg` in bash or use `nc -u -w1 $host $p </dev/null`. + +--- + +## Phase 4-pre — Verify break-glass END-TO-END (gates Phase 4) + +> Do this from an **external** network (phone hotspot / tethered), NOT the home LAN. + +- [ ] **Step 1: Without knocking, the port is silent** + +```bash +nc -z -w3 viktorbarzin.ddns.net 52222 && echo "OPEN(bad)" || echo "SILENT_OK" +``` +Expected: `SILENT_OK`. + +- [ ] **Step 2: Knock + connect succeeds** + +```bash +bg 'hostname; echo BREAKGLASS_E2E_OK' +``` +Expected: the PVE hostname + `BREAKGLASS_E2E_OK`. + +- [ ] **Step 3: Full-LAN reach via the jump (no extra install)** + +```bash +ssh -J breakglass root@10.0.20.1 'echo PFSENSE_REACHED' 2>/dev/null || echo "check pfSense ssh" +ssh -J breakglass admin@192.168.1.13 'echo SYNOLOGY_REACHED' 2>/dev/null || echo "check synology ssh" +``` +Expected: confirms you can reach pfSense + Synology *through* break-glass (so closing Rule 6 loses nothing). + +- [ ] **Step 4: LAN admin unaffected** + +From the home LAN: `ssh -p 22 root@192.168.1.127 'echo LAN22_OK'` → `LAN22_OK`. + +**GATE:** Only proceed to Phase 4 once Steps 1–4 pass. If any fail, fix before removing the legacy forward. + +--- + +## Phase 5 — Router cleanup (LIVE router change — Viktor executes, AFTER Phase 4-pre passes) + +> AX6000 UI. One pass, all three changes. + +- [ ] **Step 1: Remove the Synology SSH exposure (Rule 6)** + - Advanced → NAT Forwarding → Port Forwarding → delete (or disable) rule **`HTTP` / 3333 → 192.168.1.13:22**. + +- [ ] **Step 2: Delete the stale Proxmox rule (Rule 3)** + - Delete the disabled rule **`proxmox` / 8006 → 192.168.1.127**. + +- [ ] **Step 3: Disable UPnP** + - Advanced → NAT Forwarding → UPnP → toggle **OFF**. (Tailscale on `.101` falls back to DERP relay; the `41643→pfSense` mapping drops.) + +- [ ] **Step 4: Verify the Synology SSH is gone from the WAN, break-glass still works** + +From an external network: +```bash +nc -z -w3 viktorbarzin.ddns.net 3333 && echo "STILL_OPEN(bad)" || echo "SYNOLOGY_SSH_CLOSED_OK" +bg 'echo BREAKGLASS_STILL_OK' +``` +Expected: `SYNOLOGY_SSH_CLOSED_OK` and `BREAKGLASS_STILL_OK`. + +--- + +## Phase 6 — Docs + commit (AFTER infra repo is clean) + +- [ ] **Step 1: Update `docs/architecture/vpn.md`** — add a "Break-glass SSH" section (knock-gated SSH to PVE host, client `bg()`, cheat-sheet IPs). +- [ ] **Step 2: Update `docs/architecture/security.md` + the Wave-1 note in `infra/.claude/CLAUDE.md`** — record the deliberate knock-gated exception; **correct the WAN-exposure inventory** (actual `.1` forwards are qbittorrent/stun/turn→pfSense + the new break-glass; Synology SSH removed; UPnP disabled; Remote Management off). +- [ ] **Step 3: New runbook `docs/runbooks/breakglass-ssh.md`** — connect procedure, knock/key rotation, re-adding `.1` forwards after a router reset. +- [ ] **Step 4: Commit the design + plan + doc updates** (only once Viktor confirms the repo is committable): + +```bash +git -C /home/wizard/code/infra add \ + docs/plans/2026-05-30-breakglass-ssh-access-design.md \ + docs/plans/2026-05-30-breakglass-ssh-access-plan.md \ + docs/architecture/vpn.md docs/architecture/security.md \ + docs/runbooks/breakglass-ssh.md .claude/CLAUDE.md +git -C /home/wizard/code/infra commit -m "docs+feat: break-glass knock-gated SSH; retire Synology SSH forward; disable UPnP [ci skip]" +git -C /home/wizard/code/infra push origin master +``` + +--- + +## Self-review + +- **Spec coverage:** key-only SSH ✅ (1.3), knock gate ✅ (1.4/1.5), invisibility ✅ (4-pre.1), full-LAN via jump ✅ (4-pre.3), no-lockout ✅ (1.1/1.3.4), Wave-1 exception doc ✅ (6.2), close legacy SSH ✅ (5.1), UPnP ✅ (5.3). All design §sections map to a task. +- **Placeholder scan:** no TBDs; secret values are generated + Vault-stored, referenced via `vault kv get` (concrete, not placeholders). +- **Consistency:** port `52222`, knock from `secret/viktor/breakglass_knock_sequence`, key `~/.ssh/breakglass_ed25519`, host `192.168.1.127` used consistently throughout. +- **Open verify items** (flagged inline, non-blocking): #1 `.1` SNAT behaviour (2.3), pve-firewall coexistence (1.1.2). diff --git a/docs/plans/2026-06-01-t3-auto-provision-design.md b/docs/plans/2026-06-01-t3-auto-provision-design.md index 1c34e8a8..ca9a7e1f 100644 --- a/docs/plans/2026-06-01-t3-auto-provision-design.md +++ b/docs/plans/2026-06-01-t3-auto-provision-design.md @@ -85,12 +85,14 @@ Replaces the in-cluster nginx `t3-dispatch` (the session-mint needs `sudo` + loc Per request (Authentik forward-auth has injected a trustworthy `X-authentik-username`): 1. Resolve `X-authentik-username` → OS user via `/etc/ttyd-user-map`. No mapping → **403**. -2. **Has a valid t3 session cookie?** → reverse-proxy (incl. WebSocket upgrade) to `127.0.0.1:<T3_PORT>`. (Steady state — the common path.) -3. **No cookie** (first visit / expired) → auto-pair: +2. **Has a valid t3 session cookie?** → reverse-proxy (incl. WebSocket upgrade) to `127.0.0.1:<T3_PORT>`. (Steady state — the common path.) Sub-requests (XHR/asset/WebSocket) take the cookie at face value; on a **top-level document navigation** the cookie is verified against the instance's `GET /api/auth/session` so a present-but-dead cookie doesn't slip through. +3. **No cookie, or an invalid cookie on a document navigation** (first visit / expired / server-side session wiped) → auto-pair: - `sudo -u <os_user> t3 auth pairing create --base-dir /home/<os_user>/.t3 --ttl 5m --json` → one-time token. - exchange it at the instance's `POST /api/auth/bootstrap` → capture the returned `Set-Cookie`. - relay that `Set-Cookie` to the browser + `302 /`. Browser now holds the t3 session cookie → next request is the steady-state path. **Login → straight in.** +> **As-built note (2026-06-09):** the first implementation re-paired only on an *absent* cookie. After an auth-schema rollback wiped every server-side session, browsers still held live-looking-but-dead 30-day `t3_session` cookies, which the dispatcher proxied straight through → t3 rendered its pair page (the "all users must pair again" incident). Fixed by validating a present cookie via `/api/auth/session` and re-pairing on `authenticated:false` — **gated to document navigations** (`isDocumentNav`: trust `Sec-Fetch-Dest: document`, else fall back to `Accept: text/html`) so XHR/asset/WebSocket sub-requests are never answered with a `302`, and **fail-open** (proxy through) on any validation error so no new failure mode is introduced. See `scripts/t3-dispatch/main.go` (`sessionValid`, `isDocumentNav`) + `main_test.go`. + Implementation: a small reverse proxy that supports WebSocket upgrade (Go `httputil.ReverseProxy`, or Python aiohttp) — chosen at plan time. ### 4. Terraform — `stacks/t3code` shrinks diff --git a/docs/plans/2026-06-04-pve-fan-control-design.md b/docs/plans/2026-06-04-pve-fan-control-design.md index a2d0216f..ccc972b3 100644 --- a/docs/plans/2026-06-04-pve-fan-control-design.md +++ b/docs/plans/2026-06-04-pve-fan-control-design.md @@ -1,10 +1,32 @@ # PVE R730 presence-aware fan control — design **Date:** 2026-06-04 -**Status:** implemented +**Status:** implemented; **redesigned 2026-06-08, anti-flap 2026-06-15** (see update below) **Scripts:** `infra/scripts/fan-control.{sh,service,env.example}`, `test-fan-control.sh` **Runbook:** `infra/docs/runbooks/fan-control.md` +> ## Update — control moved to HA; host is a thin actuator +> +> - **2026-06-07:** presence/two-curve scheme replaced by a single linear curve; +> all garage-presence logic removed. +> - **2026-06-08:** **all control moved into Home Assistant.** HA owns the curve +> thresholds, duty %, an additive **bias** (replaces the ease-down hysteresis), +> plus manual/lock, and publishes `sensor.r730_fan_command_pct = +> clamp(curve(temp)+bias, 0..100)` with an asymmetric output deadband. The host +> `fan-control.sh` is now a **thin actuator**: read that one number, validate, +> apply over IPMI — no local math. Independent host safety (CPU≥83 °C, IPMI +> fail, HA loss) hands the fans to Dell auto. It's a P controller, so the curve +> slope/offset set the steady-state equilibrium temperature (not a setpoint). +> - **2026-06-15:** daemon **anti-flap** — on a transient HA miss it HOLDS the +> last applied % for `HA_GRACE_SECS` (300 s) instead of dumping to Dell auto, +> and `STALE_SECS` loosened 120→1800 (staleness only happens at flat temp, +> where the held value is still valid). Killed a ~14%-of-the-time flap to the +> Dell floor; verified fallback 14%→0%, command std 16→3 over 8 h. +> +> The HA objects (sliders, command template, display/equilibrium sensors, +> Lock/Override, dashboard cards, REST sensors) live on ha-sofia, not this repo. +> Sections below are retained as historical context. + ## Problem The Dell R730 PVE host (192.168.1.127) runs its CPU at ~72–77°C under normal diff --git a/docs/plans/2026-06-07-multi-user-workstation-design.md b/docs/plans/2026-06-07-multi-user-workstation-design.md index 2148ae27..4d80eae4 100644 --- a/docs/plans/2026-06-07-multi-user-workstation-design.md +++ b/docs/plans/2026-06-07-multi-user-workstation-design.md @@ -110,7 +110,7 @@ The Config base / machine-wide managed layer is **secret-free**. Everything carr | Auth / token | Lives in (per-user, `0600`) | New-user provisioning (from Vault) | |---|---|---| -| **Claude OAuth** | `~/.claude/.credentials.json` (or `CLAUDE_CODE_OAUTH_TOKEN`) | the shared Enterprise token (earlier decision) **or** own interactive login; emo keeps his own | +| **Claude OAuth** | `~/.claude/.credentials.json` + isolated Vault backup | own Enterprise SSO login; Claude refreshes locally and `claude-auth-sync@<user>.timer` validates/backs up/recovers `claudeAiOauth` at `secret/workstation/claude-users/<os_user>`; shared token injection is forbidden | | **`claude_memory` MCP** | `~/.claude.json` mcpServers + `MEMORY_API_KEY` in `settings.json` env | **DEFERRED — not a risk now (Viktor, 2026-06-08).** Per-user memory isolation needs a service-side `_key_to_user` map edit + redeploy (claude-memory-mcp, GHA repo 78), not just a Vault write — NOT built now. For now a new user gets a simple key or omits memory; revisit if isolation becomes a concern. | | **`ha` MCP** (token-in-URL) | `~/.claude.json` | shared `ha_sofia_mcp_url` from Vault `secret/openclaw` (one HA instance; shared secret, per-user file) — only if HA-eligible | | **`playwright` MCP** | per-user systemd unit (own port) + localhost entry | existing per-user playwright pattern (id=4015); non-secret | @@ -166,6 +166,8 @@ Design principle: **every bit of devvm setup is an idempotent git script** — n - **ADR-0002 — devvm Linux users, not K8s ephemeral pods.** Re-platforming is overkill at this scale; config-push is easier on one host. - **ADR-0003 — Config inheritance via native machine-wide layers + per-user override.** Rejected: periodic sync, OverlayFS (no live lowerdir edits), Nix (rebuild not live). - **ADR-0004 — Infra access via per-user writable git-crypt-locked clones (changes ungated).** Each non-admin gets their own writable, keyless (locked) clone — read + edit + push freely, no PR gate. Safe because infra apply is manual + admin-only (push ≠ apply, id=4355) and the clone can't decrypt secrets. Rejected: the shared read-only mirror (gated changes) and the shared unlocked tree (secret leak + commit entanglement). Trade: repo-local CLAUDE.md updates via pull, not live (global config inheritance stays live via §4). + - **AMENDED 2026-06-10 — the "push ≠ apply" premise was WRONG.** The Forgejo→Woodpecker webhook on `viktor/infra` fires `.woodpecker/default.yml` on `push` to `master` (`require_approval: forks` only), which terragrunt-applies changed stacks — so an ungated master push IS a deploy. Enforcement added instead of dropping the ADR: Forgejo **branch protection on `master`** (push + merge whitelists = `viktor`, deploy keys allowed). Non-admins keep free branch pushes + PRs; only admin merges land on master. "No PR gate" is thereby reversed for non-admins; the rest of the ADR (per-user locked clones) stands. As-built: `../architecture/multi-tenancy.md` → "Contribute access". + - **AMENDED AGAIN 2026-06-10 (later) — allow-then-audit.** Viktor granted emo (`ebarzin`) direct master push ("he's allowed to make any change; what matters is tracking what changed and why"). The PR gate is dropped FOR WHITELISTED USERS; tracking is enforced instead: agent-written commit messages must carry the user's plain-language intent (the WHY), a `notify-nonadmin-push` Slack step in `.woodpecker/default.yml` surfaces every non-admin master push, `[ci skip]` is forbidden for non-admins, and force-push stays disabled (append-only history). Accepted consequence: emo's pushes auto-apply changed stacks via CI. Branch protection + the PR fallback remain for non-whitelisted users. - **ADR-0005 — Power-user = cluster-wide read-only (no Secrets), via a NEW dedicated ClusterRole.** Re-widens cross-tenant READ for the trusted power-user tier only — but via a NEW `oidc-power-user-readonly` ClusterRole (get/list/watch, NO `secrets`), NOT the existing `oidc-power-user` (which grants read+write+Secrets and is unbound). Bound to the user's OIDC identity (kubelogin) — the apiserver accepts Authentik OIDC for the `kubernetes` audience; the dashboard's SA-token pattern is for the dashboard UI only. - **ADR-0006 — The roster is the single source of truth for the FULL lifecycle.** `roster.yaml` drives onboard *and* offboard; `/etc/ttyd-user-map`, `dispatch.json`, and Authentik `T3 Users` membership are *derived* from it, and tier is *validated* against `k8s_users` (fail-loud on mismatch). Rejected: hand-maintaining the four membership lists in parallel (guaranteed drift). Offboarding is first-class + staged (reversible cut → cluster revoke → gated `userdel`), not an afterthought. - **ADR-0007 — Add swap + a capacity budget to the devvm before onboarding active users.** A shared 24 GB / **0-swap** host OOM-kills live sessions under multi-user load (wizard alone runs ~20). Swap + a max-concurrent ceiling are prerequisites, not follow-ups. diff --git a/docs/plans/2026-06-07-multi-user-workstation-plan.md b/docs/plans/2026-06-07-multi-user-workstation-plan.md index 1bd3275c..50f788ed 100644 --- a/docs/plans/2026-06-07-multi-user-workstation-plan.md +++ b/docs/plans/2026-06-07-multi-user-workstation-plan.md @@ -129,6 +129,21 @@ users: ### Task 2.3: Inject per-user MCP + auth secrets (new users only; never clobber) +> **PARTIAL — per-user playwright browser MCP DONE (2026-06-16), reproducible from git.** +> Implemented NOT via the "write a fresh `~/.claude.json`" step below (that skips +> EXISTING users who have a `.claude.json` lacking the entry — emo + anca were +> exactly this: server running, never wired). Instead: `roster_engine.py` allocates +> a sticky per-user `PLAYWRIGHT_PORT` (`PLAYWRIGHT_BASE_PORT=8931`); `setup-devvm.sh` +> (§8c/§9e) stages the chrome-service token + installs **system-level template units** +> (`scripts/workstation/playwright/playwright-mcp@.service` + `…-snapshot-refresh@.{service,timer}`, +> no systemd --user / linger); `t3-provision-users.sh` `install_playwright()` (ALL +> tiers incl. admin) seeds the token if-absent, runs `claude mcp add --scope user +> playwright` AS the user (clobber-proof → fixes existing + new + admin), and +> `enable --now`s the instances. Replaced the hand-made `~/.config/systemd/user/playwright-*` +> units (one-time idle-gated migration). Runbook: `../runbooks/chrome-service-snapshot.md` +> → "Provisioning". **Still TODO in this task:** `ha`, `claude_memory`, +> `.credentials.json`, and the beads Dolt credential. + **Files:** Modify `infra/scripts/t3-provision-users.sh` (add `install_user_secrets`) - [ ] **Step 1:** For each non-admin **without** an existing `~/.claude.json` (NEW users only — NEVER touch an existing one): write `~/.claude.json` with `playwright-shared` (localhost), `ha` (shared `ha_sofia_mcp_url` from Vault `secret/openclaw`) if HA-eligible, and `claude_memory` using a **shared/simple key (per-user memory isolation is DEFERRED — not a risk now)**. Seed `~/.claude/.credentials.json` with the shared Claude token (Vault) **or** leave absent for interactive login. **Drop the beads Dolt credential** into `~/code/.beads/` (`.beads-credential-key`, from Vault, or set `DOLT_REMOTE_PASSWORD`) so `bd` authenticates — it's git-ignored, so a fresh clone lacks it. All `0600`, owned by the user. Per-user `playwright-mcp` systemd unit on its own port (existing pattern, id=4015). @@ -171,6 +186,8 @@ users: ### Task 5.1: Cut emo over to his own writable locked clone (opt-in, reversible) +> **DONE 2026-06-10** (staged across 06-08 → 06-10), with two deviations: (1) step 4(c) **skipped deliberately** — the live `/etc/skel` shared base delivers `~/.claude/{rules,skills}` AS symlinks into the admin base, so emo's existing symlinks match the as-built design and were kept; (2) push access was **added** (not in this plan): `ebarzin` = write collaborator on Forgejo `viktor/infra` + PAT in `~/.git-credentials` + `forgejo` remote, with `master` branch-protected (see ADR-0004 amendment — push to master auto-applies via Woodpecker, so it is whitelist-gated to `viktor`). Verified: branch push OK, master push rejected, `code-shared` removed, admin tree unreadable as emo. + **Files:** none (host state; an explicit one-time action — NOT the routine reconcile) - [ ] **Step 1: Prereqs.** Confirm emo inherits config (Phase 1) + has his scoped kubeconfig (Phase 2). (Phase 3 deliberately SKIPPED emo — his clone is created *here*.) diff --git a/docs/plans/2026-06-09-workstation-authentik-membership-plan.md b/docs/plans/2026-06-09-workstation-authentik-membership-plan.md new file mode 100644 index 00000000..1803e04d --- /dev/null +++ b/docs/plans/2026-06-09-workstation-authentik-membership-plan.md @@ -0,0 +1,469 @@ +# Workstation Membership v2 — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax. This is **infra** work: the engine tasks are real pytest TDD; the host/Authentik tasks "verify" via an idempotent re-run + a smoke check with expected output. Honor the Terraform-only rule for cluster/Authentik changes (`scripts/tg apply`); devvm host scripts are the accepted exception. Claim `host:devvm` before host mutations and `stack:authentik` before applying Authentik. + +**Goal:** Make the Authentik `T3 Users` group membership the single source of truth for who gets a devvm workstation account, identified by email; retire `roster.yaml`. + +**Architecture:** The provisioner reads `T3 Users` members from the Authentik API (read-only token) instead of `roster.yaml`. A pure engine derives the Linux `os_user` from each member's email (or an `os_user` Authentik attribute override) and produces the same desired-state shape v1 already applies. Workstation access stays fully decoupled from cluster RBAC (`k8s_users` untouched). wizard is special-cased as the admin/owner. + +**Tech Stack:** Python (pure engine, pytest) + Bash (provisioner) + `jq`/`curl` (Authentik API) + Terraform (`stacks/authentik`: read-only token, drop HCL members). + +**Design:** `infra/docs/plans/2026-06-09-workstation-authentik-membership-design.md`. + +--- + +## File structure + +- Modify: `infra/scripts/workstation/roster_engine.py` — add `derive_os_user()` + `roster_from_members()` (pure). +- Modify: `infra/scripts/workstation/test_roster_engine.py` — tests for the two new functions. +- Modify: `infra/scripts/t3-provision-users.sh` — source members from the Authentik API instead of `roster.yaml`. +- Modify: `infra/scripts/workstation/setup-devvm.sh` — drop the read-only Authentik token to `/etc/t3-serve/authentik-token`. +- Create: `infra/stacks/authentik/t3-provision-token.tf` — read-only service account + API token. +- Modify: `infra/stacks/authentik/t3-users.tf` — drop the HCL `users` list (membership becomes Authentik-managed). +- Delete: `infra/scripts/workstation/roster.yaml` (Task 7). +- Modify: `infra/.claude/reference/service-catalog.md`, `infra/docs/architecture/multi-tenancy.md` (Task 7). + +--- + +## Task 1: Engine — `derive_os_user()` + +**Files:** Modify `infra/scripts/workstation/roster_engine.py`; Test `infra/scripts/workstation/test_roster_engine.py` + +- [ ] **Step 1: Write the failing tests** (append to `test_roster_engine.py`) + +```python +# --- derive_os_user: email/attribute -> Linux username (v2) --- + +def test_derive_os_user_sanitizes_email_local_part(): + assert eng.derive_os_user("emil.barzin@gmail.com", None) == "emil_barzin" + + +def test_derive_os_user_attribute_overrides(): + assert eng.derive_os_user("emil.barzin@gmail.com", "emo") == "emo" + + +def test_derive_os_user_lowercases_and_replaces_unsafe_runs(): + assert eng.derive_os_user("Weird.Name+tag@x.com", None) == "weird_name_tag" + + +def test_derive_os_user_truncates_to_32(): + long = ("a" * 40) + "@x.com" + assert eng.derive_os_user(long, None) == "a" * 32 + + +def test_derive_os_user_blank_attribute_is_ignored(): + assert eng.derive_os_user("emil.barzin@gmail.com", "") == "emil_barzin" +``` + +- [ ] **Step 2: Run to verify they fail** + +Run: `cd infra/scripts/workstation && python3 -m pytest test_roster_engine.py -k derive_os_user -q` +Expected: FAIL — `AttributeError: module 'roster_engine' has no attribute 'derive_os_user'` + +- [ ] **Step 3: Implement** (add to `roster_engine.py`, after `RosterError`) + +```python +import re + +_MAX_USERNAME = 32 + + +def derive_os_user(email: str, os_user_attr: str | None) -> str: + """Linux username for a workstation member: the explicit `os_user` Authentik + attribute if set, else the email local-part sanitized to a valid username + (lowercase; runs of non [a-z0-9_-] -> '_'; stripped; <=32 chars).""" + if os_user_attr: + return os_user_attr + local = email.split("@", 1)[0].lower() + cleaned = re.sub(r"[^a-z0-9_-]+", "_", local).strip("_") + return cleaned[:_MAX_USERNAME] +``` + +- [ ] **Step 4: Run to verify they pass** + +Run: `python3 -m pytest test_roster_engine.py -k derive_os_user -q` +Expected: PASS (5 passed) + +- [ ] **Step 5: Commit** + +```bash +cd /home/wizard/code/infra +git add scripts/workstation/roster_engine.py scripts/workstation/test_roster_engine.py +git commit -m "workstation: engine derive_os_user (email/attribute -> Linux username)" +``` + +--- + +## Task 2: Engine — `roster_from_members()` + +Builds a `Roster` (the v1 type `derive_desired_state` already consumes) from the Authentik member list, so the existing tested derivation is reused unchanged. + +**Files:** Modify `roster_engine.py`; Test `test_roster_engine.py` + +- [ ] **Step 1: Write the failing tests** + +```python +# --- roster_from_members: Authentik members -> Roster (v2) --- + +MEMBERS = [ + {"email": "vbarzin@gmail.com", "os_user": "wizard"}, + {"email": "emil.barzin@gmail.com", "os_user": "emo"}, + {"email": "ancaelena98@gmail.com", "os_user": "ancamilea"}, +] +ADMINS = {"vbarzin@gmail.com"} + + +def test_roster_from_members_maps_identity_fields(): + r = eng.roster_from_members(MEMBERS, ADMINS) + u = r.users["emo"] + assert u.os_user == "emo" + assert u.authentik_user == "emil.barzin" # email local-part = t3-dispatch key + assert u.k8s_user == "emil.barzin@gmail.com" # email = identity + assert u.tier == "power-user" # non-admin + + +def test_roster_from_members_admin_by_email(): + r = eng.roster_from_members(MEMBERS, ADMINS) + assert r.users["wizard"].tier == "admin" + + +def test_roster_from_members_derives_os_user_when_no_override(): + r = eng.roster_from_members([{"email": "jane.doe@x.com", "os_user": None}], set()) + assert "jane_doe" in r.users + assert r.users["jane_doe"].tier == "power-user" + + +def test_roster_from_members_raises_on_os_user_collision(): + members = [{"email": "a@x.com", "os_user": "dup"}, {"email": "b@y.com", "os_user": "dup"}] + with pytest.raises(eng.RosterError, match="collision"): + eng.roster_from_members(members, set()) + + +def test_roster_from_members_reuses_derive_desired_state(): + r = eng.roster_from_members(MEMBERS, ADMINS) + ds = eng.derive_desired_state(r, {"wizard": 3773, "emo": 3774, "ancamilea": 3775}) + assert ds.dispatch["emil.barzin"] == {"os_user": "emo", "port": 3774} + assert ds.accounts["wizard"].groups == ("code-shared", "docker", "sudo") + assert ds.accounts["emo"].groups == () +``` + +- [ ] **Step 2: Run to verify they fail** + +Run: `python3 -m pytest test_roster_engine.py -k roster_from_members -q` +Expected: FAIL — `AttributeError: ... 'roster_from_members'` + +- [ ] **Step 3: Implement** (add to `roster_engine.py`) + +```python +def roster_from_members(members: list[dict], admin_emails: set[str]) -> Roster: + """Build a Roster from Authentik `T3 Users` members. Each member dict has + `email` and optional `os_user`. tier = admin iff the email is in admin_emails, + else power-user (a non-admin workstation: no groups, locked clone). Raises on + an os_user collision (two emails resolving to the same Linux username).""" + users: dict[str, User] = {} + for m in members: + email = m["email"] + os_user = derive_os_user(email, m.get("os_user")) + if os_user in users: + raise RosterError( + f"os_user collision: {email!r} and {users[os_user].k8s_user!r} " + f"both resolve to {os_user!r} (set an os_user attribute to disambiguate)" + ) + tier = "admin" if email in admin_emails else "power-user" + users[os_user] = User( + os_user=os_user, + authentik_user=email.split("@", 1)[0], + k8s_user=email, + tier=tier, + namespaces=(), + ) + return Roster(users) +``` + +- [ ] **Step 4: Run the whole suite** + +Run: `python3 -m pytest test_roster_engine.py -q && ruff check roster_engine.py test_roster_engine.py` +Expected: PASS (all, incl. the v1 tests) + ruff clean + +- [ ] **Step 5: Commit** + +```bash +git add scripts/workstation/roster_engine.py scripts/workstation/test_roster_engine.py +git commit -m "workstation: engine roster_from_members (Authentik members -> Roster, reuses derive)" +``` + +--- + +## Task 3: Read-only Authentik token (Terraform) + +**Files:** Create `infra/stacks/authentik/t3-provision-token.tf` + +- [ ] **Step 1: Write the resources** (service account + API token + view permissions) + +```hcl +# Read-only service account whose token the devvm provisioner uses to list +# "T3 Users" members. View-only: it can read users + groups, nothing else. +resource "authentik_user" "t3_provision" { + username = "t3-provision-bot" + name = "T3 Provision (read-only)" + type = "service_account" + path = "service-accounts" +} + +resource "authentik_token" "t3_provision" { + identifier = "t3-provision-readonly" + user = authentik_user.t3_provision.id + intent = "api" + description = "devvm t3-provision-users: read T3 Users membership" + retrieve_key = true +} + +# Global view permissions for the service account (users + groups read only). +resource "authentik_rbac_permission_user" "t3_provision_view_user" { + user = authentik_user.t3_provision.id + permission = "authentik_core.view_user" +} + +resource "authentik_rbac_permission_user" "t3_provision_view_group" { + user = authentik_user.t3_provision.id + permission = "authentik_core.view_group" +} + +output "t3_provision_token" { + value = authentik_token.t3_provision.key + sensitive = true +} +``` + +- [ ] **Step 2: Apply** (claim first) + +```bash +~/code/scripts/presence claim stack:authentik --purpose "v2: read-only t3-provision token" +export VAULT_ADDR=https://vault.viktorbarzin.me && vault login -method=oidc +cd /home/wizard/code/infra/stacks/authentik && ../../scripts/tg apply -target=authentik_user.t3_provision -target=authentik_token.t3_provision -target=authentik_rbac_permission_user.t3_provision_view_user -target=authentik_rbac_permission_user.t3_provision_view_group --non-interactive +``` +Expected: 4 added. (If the `authentik_rbac_permission_user` resource/permission codename differs in the installed provider, run `../../scripts/tg console` / check the provider docs and adjust the codename; verify in Step 3.) + +- [ ] **Step 3: Store the token in Vault + verify it is read-only** + +```bash +TOK=$(../../scripts/tg output -raw t3_provision_token) +vault kv patch secret/authentik t3_provision_token="$TOK" +# verify: can LIST T3 Users members... +curl -sk -H "Authorization: Bearer $TOK" "https://authentik.viktorbarzin.me/api/v3/core/users/?groups_by_name=T3%20Users" | jq -r '.results[].email' +# ...but CANNOT write (expect 403): +curl -sk -o /dev/null -w '%{http_code}\n' -X PATCH -H "Authorization: Bearer $TOK" -H 'Content-Type: application/json' -d '{"name":"x"}' "https://authentik.viktorbarzin.me/api/v3/core/users/14/" +``` +Expected: the three emails listed; the PATCH returns `403`. + +- [ ] **Step 4: Commit** + +```bash +git add stacks/authentik/t3-provision-token.tf +git commit -m "workstation: read-only Authentik token for the t3-provision membership query" +``` + +--- + +## Task 4: setup-devvm.sh — stage the token for the root provisioner + +**Files:** Modify `infra/scripts/workstation/setup-devvm.sh` + +- [ ] **Step 1: Add a token-staging step** (after step 6, before the final `log "OK"`). The hourly provisioner runs as root with no Vault token, so `setup-devvm.sh` (run by wizard, who can read Vault) drops it to a root-only file. + +```bash +# 8) stage the read-only Authentik token for the root provisioner's membership query. +if command -v vault >/dev/null; then + export VAULT_ADDR="${VAULT_ADDR:-https://vault.viktorbarzin.me}" + if tok="$(vault kv get -field=t3_provision_token secret/authentik 2>/dev/null)"; then + install -m 0600 /dev/stdin /etc/t3-serve/authentik-token <<<"$tok" + log "staged /etc/t3-serve/authentik-token (read-only Authentik API)" + else + log "WARN: t3_provision_token not in Vault -> Authentik membership query will be skipped" + fi +fi +``` + +- [ ] **Step 2: Run + verify** + +Run: `sudo bash /home/wizard/code/infra/scripts/workstation/setup-devvm.sh 2>&1 | grep -E 'authentik-token|OK'` then `sudo stat -c '%a %U' /etc/t3-serve/authentik-token` +Expected: "staged ... authentik-token" + `OK`; perms `600 root`. + +- [ ] **Step 3: Commit** + +```bash +git add scripts/workstation/setup-devvm.sh +git commit -m "workstation: setup-devvm.sh stages the read-only Authentik token (root-only)" +``` + +--- + +## Task 5: Provisioner — source members from Authentik (replace roster.yaml) + +**Files:** Modify `infra/scripts/t3-provision-users.sh` + +- [ ] **Step 1: Add a members-fetch + swap the engine call.** Replace the roster-read/derive block. Fetch members from Authentik (best-effort); build the members JSON `[{email, os_user}]`; pass to the engine via a new `--members-json` mode on `derive`. + +First extend the engine CLI (`roster_engine.py` `_main`): add `derive-members` that reads a members JSON + ports JSON + admin emails and emits the same desired-state JSON. + +```python +# in _main(), add a subparser: + pm = sub.add_parser("derive-members", help="desired state from an Authentik member list") + pm.add_argument("--members-json", required=True) + pm.add_argument("--ports-json", required=True) + pm.add_argument("--admin-emails", default="", help="comma-separated admin emails") + # ...in the dispatch: + if args.cmd == "derive-members": + with open(args.members_json, encoding="utf-8") as fh: + members = json.load(fh) + with open(args.ports_json, encoding="utf-8") as fh: + ports = json.load(fh) + admins = {e for e in args.admin_emails.split(",") if e} + ds = derive_desired_state(roster_from_members(members, admins), ports) + json.dump(_desired_state_to_dict(ds), sys.stdout, indent=2, sort_keys=True) + sys.stdout.write("\n") + return 0 +``` + +In `t3-provision-users.sh`, replace the `ROSTER`/validate/derive section with: + +```bash +AUTHENTIK_URL="${AUTHENTIK_URL:-https://authentik.viktorbarzin.me}" +TOKEN_FILE="${TOKEN_FILE:-/etc/t3-serve/authentik-token}" +T3_GROUP="${T3_GROUP:-T3 Users}" +ADMIN_EMAILS="${WORKSTATION_ADMIN_EMAILS:-vbarzin@gmail.com}" + +members_file="$(mktemp)"; trap 'rm -f "$ports_file" "$members_file" "${desired_file:-}"' EXIT +if [[ -r "$TOKEN_FILE" ]]; then + tok="$(cat "$TOKEN_FILE")" + if curl -sf -H "Authorization: Bearer $tok" --get \ + --data-urlencode "groups_by_name=$T3_GROUP" \ + "$AUTHENTIK_URL/api/v3/core/users/" \ + | jq -c '[.results[] | select(.is_active) | {email: .email, os_user: (.attributes.os_user // null)}]' \ + > "$members_file" && [[ -s "$members_file" ]]; then + : + else + log "WARN: Authentik membership query failed -> no membership change this run"; echo '[]' > "$members_file" + SKIP_RECONCILE=1 + fi +else + log "WARN: $TOKEN_FILE absent -> no membership change this run"; echo '[]' > "$members_file"; SKIP_RECONCILE=1 +fi + +if [[ "${SKIP_RECONCILE:-0}" == 1 ]]; then log "reconcile skipped (no Authentik membership)"; exit 0; fi + +desired_file="$(mktemp)" +python3 "$ENGINE" derive-members --members-json "$members_file" --ports-json "$ports_file" --admin-emails "$ADMIN_EMAILS" > "$desired_file" +jq -e . "$desired_file" >/dev/null || { echo "[t3-provision] derive-members produced invalid JSON" >&2; exit 1; } +``` + +(Keep steps 4-6 of the existing script — accounts/groups/clone/kubeconfig, .env/enable, regen map/dispatch — unchanged; they consume `$desired_file`.) + +- [ ] **Step 2: shellcheck + DRY_RUN** (with the staged token present) + +Run: `cd /home/wizard/code/infra/scripts && shellcheck -S warning t3-provision-users.sh && sudo DRY_RUN=1 bash t3-provision-users.sh 2>&1 | grep -iE 'clone|kubeconfig|reconcile|WARN'` +Expected: shellcheck clean; dry-run lists the current members, no account creations (all exist), "reconcile complete (DRY-RUN)". + +- [ ] **Step 3: Real run + verify it reproduces current state** + +Run: `sudo jq -S . /etc/t3-serve/dispatch.json > /tmp/d1; sudo DRY_RUN=0 bash t3-provision-users.sh >/dev/null 2>&1; sudo jq -S . /etc/t3-serve/dispatch.json > /tmp/d2; diff /tmp/d1 /tmp/d2 && echo SAME; id -nG emo` +Expected: `SAME` (dispatch content unchanged); emo groups unchanged. Redeploy: `sudo install -m0755 t3-provision-users.sh /usr/local/bin/t3-provision-users`. + +- [ ] **Step 4: Commit** + +```bash +git add scripts/t3-provision-users.sh scripts/workstation/roster_engine.py scripts/workstation/test_roster_engine.py +git commit -m "workstation: provisioner sources members from Authentik T3 Users (replaces roster.yaml)" +``` + +--- + +## Task 6: Authentik — Authentik-managed membership + legacy os_user attributes + +**Files:** Modify `infra/stacks/authentik/t3-users.tf`; set user attributes via API. + +- [ ] **Step 1: Set the legacy os_user attributes** (the 3 existing accounts don't derive from their emails). Read-merge-write so existing attributes are preserved (Authentik PATCH replaces the `attributes` dict). + +```bash +export VAULT_ADDR=https://vault.viktorbarzin.me +TOK=$(vault kv get -field=tf_api_token secret/authentik) +A=https://authentik.viktorbarzin.me/api/v3 +set_os_user() { # $1=username $2=os_user + local pk attrs + pk=$(curl -sk -H "Authorization: Bearer $TOK" "$A/core/users/?username=$1" | jq '.results[0].pk') + attrs=$(curl -sk -H "Authorization: Bearer $TOK" "$A/core/users/$pk/" | jq -c --arg o "$2" '.attributes + {os_user:$o}') + curl -sk -X PATCH -H "Authorization: Bearer $TOK" -H 'Content-Type: application/json' \ + -d "{\"attributes\":$attrs}" "$A/core/users/$pk/" | jq -r '.username + " os_user=" + .attributes.os_user' +} +set_os_user "vbarzin@gmail.com" wizard +set_os_user "emil.barzin@gmail.com" emo +set_os_user "ancaelena98@gmail.com" ancamilea +``` +Expected: three lines confirming `os_user=` each. + +- [ ] **Step 2: Drop the HCL `users` list** so membership is Authentik-managed. Edit `t3-users.tf`: remove the `users = [...]` argument from `resource "authentik_group" "t3_users"` (keep the `data "authentik_user"` lookups removed too if now unused). Leave the group resource (name only). + +```hcl +resource "authentik_group" "t3_users" { + name = "T3 Users" + # Membership is managed in Authentik (UI/API), not Terraform — the devvm + # provisioner reconciles workstation accounts from this group's members. +} +``` + +- [ ] **Step 3: Apply + verify members unchanged** + +```bash +cd /home/wizard/code/infra/stacks/authentik && ../../scripts/tg apply -target=authentik_group.t3_users --non-interactive +curl -sk -H "Authorization: Bearer $TOK" "$A/core/groups/?search=T3%20Users" | jq -r '.results[0].users_obj[].username' +``` +Expected: apply shows the group updated (no member change / the `users` field no longer managed); the 3 members still listed. + +- [ ] **Step 4: Commit** + +```bash +git add stacks/authentik/t3-users.tf +git commit -m "workstation: T3 Users membership is Authentik-managed (drop HCL member list)" +``` + +--- + +## Task 7: Retire roster.yaml + update docs + +**Files:** Delete `infra/scripts/workstation/roster.yaml`; modify `service-catalog.md`, `multi-tenancy.md`. + +- [ ] **Step 1: Confirm nothing reads roster.yaml anymore** + +Run: `grep -rn 'roster.yaml\|roster_engine.*roster\b' /home/wizard/code/infra/scripts /home/wizard/code/infra/docs | grep -v 'load_roster\|test_\|design.md\|-plan.md'` +Expected: no live references in the provisioner (the engine keeps `load_roster` for tests, that's fine). + +- [ ] **Step 2: Delete it + update the service-catalog t3code row** — change "Source of truth = roster.yaml" to "Source of truth = the Authentik `T3 Users` group (members → accounts via the read-only API token); `os_user` from the email or a per-user `os_user` attribute". Update the multi-tenancy Workstation section's "single source of truth" line likewise. + +```bash +git rm scripts/workstation/roster.yaml +# (edit service-catalog.md + multi-tenancy.md per above) +``` + +- [ ] **Step 3: Commit** + +```bash +git add scripts/workstation/roster.yaml .claude/reference/service-catalog.md docs/architecture/multi-tenancy.md +git commit -m "workstation: retire roster.yaml — Authentik T3 Users group is the membership SSoT" +``` + +--- + +## Task 8: End-to-end smoke (add + remove a throwaway member) + +- [ ] **Step 1: Add a throwaway test member** to `T3 Users` in Authentik (a test user, or temporarily add an existing one), set no `os_user` attribute. Run `sudo /usr/local/bin/t3-provision-users` and confirm an account `<derived>` is created (`id <derived>`), with a locked `~/code` (secret file shows `GITCRYPT`) and `~/.kube/config`. +- [ ] **Step 2: Remove the test member** from the group; run the reconcile; confirm they drop out of `/etc/ttyd-user-map` + `dispatch.json` (the reversible cut). Leave `userdel` to the gated offboarding runbook. +- [ ] **Step 3: Verify the 3 real users are intact** — `id emo` (groups unchanged), emo/ancamilea/wizard still in `dispatch.json`, their `t3-serve@` active, emo's locked clone + ancamilea's intact. + +--- + +## Self-review + +- **Spec coverage:** Authentik-as-SSoT (Tasks 5,6) · email identity + os_user derive/override (Tasks 1,6) · provisioner reads the API (Task 5) · read-only token for the root timer (Tasks 3,4) · roster.yaml retires (Task 7) · k8s_users/cluster untouched (no task touches it) · wizard special-cased (admin_emails, Task 2). All covered. +- **Type consistency:** `derive_os_user(email, os_user_attr)` and `roster_from_members(members, admin_emails)` used consistently; `members` dicts are `{email, os_user}`; reuses the existing `User`/`Roster`/`derive_desired_state`/`DesiredState`. +- **apiserver-OIDC:** out of scope here (kubectl auth method only) — flagged in the design; the generic kubeconfig task is unchanged from v1. +- **Open risk:** the `authentik_rbac_permission_user` resource name / permission codenames may differ in the installed provider version (Task 3) — Step 3 verifies read-works/write-403 and says to adjust if needed. diff --git a/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md b/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md new file mode 100644 index 00000000..d555d971 --- /dev/null +++ b/docs/plans/2026-06-11-breakglass-ssh-redesign-design.md @@ -0,0 +1,73 @@ +# Break-glass SSH — Redesign + +- **Date**: 2026-06-11 +- **Status**: Implemented +- **Owner**: Viktor +- **Supersedes**: `2026-05-30-breakglass-ssh-access-{design,plan}.md` (port-knock design) +- **As-built runbook**: `docs/runbooks/breakglass-ssh.md` + +## Why redesign + +The 2026-05-30 design gated a key-only SSH port on the Proxmox host behind a UDP +**port-knock** (knockd). It caused a real lockout, for a structural reason: + +- The knock sequence was 3 random ports stored **only** in Vault, and the client + helper fetched it from Vault at connect time. +- **Vault is in-cluster** and not publicly reachable (Wave-1 policy). In the + exact scenario break-glass exists for — away from home, cluster/tunnels down — + the knock sequence is unreachable and unmemorable. Circular dependency. + +The knock's only benefit was hiding an already brute-force-proof port; its cost +was that fragility. For a *recovery* path, robustness beats stealth. + +## Decision + +**Plain key-only SSH to the Proxmox host on `:52222`, openly reachable, no knock.** +Hardened with: the exposed port trusts only a dedicated break-glass key +(`Match LocalPort`), per-source connection rate-limiting (iptables hashlimit), +and fail2ban. Scenario covered: *cluster + tunnels down, host + pfSense + router +up* (the common "I'm away and need in" case — confirmed with Viktor; deeper +"pfSense wedged" / "host down" tiers are explicitly out of scope). + +Alternatives considered and rejected: keeping the knock (fragile, circular); +Tailscale-on-pfSense (briefly chosen, then dropped — reintroduces the upstream +dependency Headscale is self-hosted to avoid, and the user preferred a +self-contained stock-ssh path); WireGuard road-warrior (needs a client, and the +self-contained SSH path was preferred). + +## Components + +| Layer | Change | Source of truth | +|---|---|---| +| sshd | dual-port `:22` (LAN, all keys) + `:52222` (WAN, break-glass key only via `Match LocalPort`, terminated by `Match all`); key-only everywhere | `scripts/sshd-10-breakglass.conf` | +| host firewall | `BREAKGLASS` chain: `:52222` rate-limited per source, LAN bypass; replaced the knock-gated default-DROP | `scripts/breakglass-firewall.sh` (+ `breakglass-firewall.service`) | +| fail2ban | jail fixed for Debian 13 (`journalmatch` by unit, not `_COMM=sshd`, else it never bans), bans on `:22`+`:52222` | `scripts/fail2ban-breakglass-sshd.local` | +| knockd | **removed** (package purged, config deleted) | — | +| edge router | `breakglass-ssh` WAN tcp/52222 → 192.168.1.127:52222; **removed** legacy Synology SSH forward (ext 3333 → .13:22) | manual (live device) | +| Vault | `breakglass_ssh_{pub,priv}key` retained; `breakglass_knock_sequence` now dead | `secret/viktor` | + +## Edge-router constraints discovered (TP-Link AX6000) + +- **No port remapping** — external port must equal internal port (rejects e.g. + `22 → 52222` as a "conflict"). All forwards are ext==int; hence `:52222` both + sides. +- **Port 22 is reserved** — `22 → 22` is also refused. Break-glass cannot use 22 + (Viktor's initial preference); `:52222` is the landed port. +- **Row delete is immediate** (no confirm dialog). + +## Security posture + +- **Brute force: impossible** (key-only, no password). +- **Scannable: yes** — deliberate, documented Wave-1 exception (`security.md`). +- **Residual risks:** sshd 0-day during exposure (mitigate: patch, rate-limit, + fail2ban, low MaxAuthTries); break-glass key theft (revoke by removing the + `authorized_keys.breakglass` line). Logins are audited (PVE ships sshd auth + + snoopy execve to Loki). + +## Verification (2026-06-11) + +- `:52222` reachable; break-glass key authenticates (`root@pve`). +- Non-break-glass keys **rejected** on `:52222` (Match isolation works). +- `:22` LAN admin unaffected (Match all reset confirmed — global root login intact). +- Full WAN path: `ssh -p 52222 <WAN-IP>` with the break-glass key → `root@pve`. +- knockd gone; fail2ban jail matches Debian 13 `sshd-session` lines. diff --git a/docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md b/docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md new file mode 100644 index 00000000..64a28d1c --- /dev/null +++ b/docs/plans/2026-06-21-eso-0.12-to-2.x-migration-design.md @@ -0,0 +1,243 @@ +# External Secrets Operator: 0.12.1 → 2.6.0 Migration (v1beta1 → v1) — Design Doc + +> **Status:** ✅ **COMPLETE (2026-06-22).** ESO at chart/app **2.6.0**; all 104 ExternalSecrets + 2 ClusterSecretStores on `external-secrets.io/v1`; 109 ESs SecretSynced (2 pre-existing dead); compat-gate now returns `OK: cluster is safe to upgrade to 1.35.6` (EXIT 0) — the last k8s-1.35 blocker is cleared. Executed Phase 1 (climb to 0.16.2) → Phase 2 (v1 rewrite, validated GC-survival on tandoor) → Phase 3 (climb 0.16.2→2.6.0 across the 0.17 cutoff, ES sync held at 109 every hop). Side-finding fixed: repo-wide stale `.terraform.lock.hcl` files (missing gavinbunney/kubectl + telmate/proxmox from the generated providers.tf) had broken `terragrunt apply` for ~28 stacks (this is what failed CI pipeline 332) — reconciled via `init -upgrade` + committed. +> **Scope:** Upgrade the ESO Helm chart `0.12.1` (app `v0.12.1`) to `2.6.0` (app `v2.6.0`) and migrate every `external-secrets.io/v1beta1` custom resource to `external-secrets.io/v1`. +> **Owner:** Viktor Barzin. **Author:** Claude (research + design only — no changes applied). +> +> **EXECUTION CORRECTION + STATUS (2026-06-21 — "let's do the ESO migration"):** The cluster is already on **k8s 1.34.9** (all 7 nodes), NOT ≤1.31 as §4.3 assumed. ESO 0.12 runs fine on 1.34 (the support-matrix bands are conservative *tested* ranges, not hard limits). **The entire ESO climb 0.12→2.6 therefore happens on k8s 1.34 — there is NO k8s interleave; IGNORE the "advance k8s to 1.32/1.33" steps in §4.3 / Phase 1 / Phase 3.** Only AFTER ESO reaches 2.x does the nightly version-check chain take k8s 1.34→1.35 (gate clears). Exact hop sequence (latest patch per minor): **0.13.0 → 0.14.4 → 0.15.1 → 0.16.2** [rewrite all 104 CRs to `v1` here] → **0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0**. Pre-flight done: CRD `storedVersions` are `["v1beta1"]` only (no v1alpha1 patch needed). +> +> **EXECUTION LOG:** +> - **✅ Phase 1 DONE (2026-06-21):** ESO climbed 0.12.1 → 0.13.0 → 0.14.4 → 0.15.1 → **0.16.2**, one hop at a time, each applied + verified (controller healthy; 108 live ExternalSecrets stayed SecretSynced; 2 pre-existing dead — `instagram-poster/instagram-poster-secrets` False since 2026-05-10, `payslip-ingest/payslip-ingest-secrets` False since 2026-04-25, both missing Vault data, untouched). Added `atomic=true` + `timeout=600` to the helm_release. At 0.16.2 **both `v1beta1` and `v1` are served** (110 each) and `storedVersions = ["v1beta1","v1"]`. Committed (`eso: Phase 1 …`); state auto-committed per hop by `scripts/tg`. +> - **⏳ Phase 2 PENDING — findings confirmed (decisive for execution):** (a) bumping a `kubernetes_manifest` ExternalSecret's apiVersion v1beta1→v1 **forces a REPLACE** (verified live on instagram-poster: `-/+ must be replaced`), NOT in-place. (b) Our ExternalSecrets use **`creationPolicy=Owner`** (default; confirmed on nextcloud) → target Secrets carry an ownerReference, so the replace's delete step can **cascade-GC the Secret** before ESO recreates it. → **Phase 2 must be done carefully, NOT a blind bulk apply:** (1) snapshot ALL target Secrets first (backstop); (2) **empirically validate on the FIRST live stack** — migrate one ES while watching its target Secret; ESO re-syncs the identical spec fast and should re-adopt before GC, but confirm before proceeding; (3) then the per-stack two-phase `-target`-then-full apply (the 15 plan-time-coupled stacks need `-target` first). If validation shows GC wins, pivot to `state rm` + `import {}` (adopts the already-v1-served object with zero delete → zero GC). Repo is clean at v1beta1 (the lone test edit was reverted, never applied). +> - **Phase 3 PENDING:** hops 0.17.0 → 0.18.2 → 0.19.2 → 0.20.4 → 1.0.0 → 1.1.1 → 1.2.1 → 1.3.2 → 2.0.1 → 2.1.0 → 2.2.0 → 2.3.0 → 2.4.1 → 2.5.0 → 2.6.0 (all on k8s 1.34, CRs already v1). Crossing **0.17 is the point of no return**. + +--- + +## 1. Goal & why + +ESO is the **last remaining compatibility gate blocking the autonomous k8s 1.35 upgrade** (Kyverno was cleared to 1.18.1 earlier today). The installed ESO `0.12.x` supports only Kubernetes **1.19 → 1.31** ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)); the k8s-version-check chain will refuse to advance the cluster past 1.31 while ESO sits at 0.12. The `2.x` series supports **k8s 1.34–1.35**, which clears the gate. + +The hard part is not the chart bump itself — it is that **ESO removed the `external-secrets.io/v1beta1` API**, and every one of our ExternalSecret / ClusterSecretStore resources is currently declared `v1beta1`. If we upgrade past the removal version without first rewriting the manifests to `v1`, ESO stops reconciling and synced Secrets go stale (apps keep their last-good Secret, but rotations and new secrets break). + +**Downtime tolerance:** brief, recoverable downtime of the ESO *controller* is acceptable. What must NOT happen is loss/corruption of the downstream Kubernetes `Secret` objects that apps mount (DB creds, API keys). Those must survive continuously. + +--- + +## 2. Current state + +### 2.1 Versions +| Component | Current | Target | +|---|---|---| +| Helm chart `external-secrets` | **0.12.1** | **2.6.0** | +| App / controller image | **v0.12.1** | **v2.6.0** | +| API version of all CRs | **`external-secrets.io/v1beta1`** | **`external-secrets.io/v1`** | +| Repo: `https://charts.external-secrets.io` | (unchanged) | (unchanged) | + +ESO stack: `stacks/external-secrets/main.tf`. `helm_release.external_secrets` pins `version = "0.12.1"`, namespace `external-secrets` (separate `kubernetes_namespace` resource, not `create_namespace`), and the **only** chart value set is `installCRDs = true` (via `yamlencode({ installCRDs = true })`). No webhook/replica/resource overrides. + +### 2.2 Inventory (live, from `stacks/`) +| Kind | Count | apiVersion | Where | +|---|---|---|---| +| **ExternalSecret** (`kubernetes_manifest`) | **104** | all `v1beta1` (0 mismatches) | 73 `.tf` files | +| **ClusterSecretStore** (definitions) | **2** | both `v1beta1` | `stacks/external-secrets/main.tf` | +| SecretStore | 0 | — | — | +| PushSecret | 0 | — | — | +| ClusterExternalSecret | 0 | — | — | + +- **Only ONE apiVersion string exists in the whole tree:** `external-secrets.io/v1beta1` (106 occurrences = 104 ExternalSecret + 2 ClusterSecretStore). Zero `v1`, zero `v1alpha1`. → a clean single-target rewrite. +- **`secretStoreRef` split:** 78 ExternalSecrets → `vault-kv`, 26 → `vault-database` (78 + 26 = 104). The `kind = "ClusterSecretStore"` string also appears inside every `secretStoreRef`, so a naive `grep 'kind = "ClusterSecretStore"'` returns 106 — only **2** are real store definitions. +- **22 files carry >1 ExternalSecret** (max: `stacks/fire-planner/main.tf` = 5; then wealthfolio / real-estate-crawler / phpipam / payslip-ingest / n8n / job-hunter / ebooks = 3 each; 13 files = 2). The 104-vs-73 gap is these multi-secret files. +- **Nested-module ExternalSecrets** (easy to miss when scripting the bump): `stacks/instagram-poster/modules/instagram-poster/main.tf`, `stacks/postiz/modules/postiz/main.tf`, `stacks/technitium/modules/technitium/main.tf`, `stacks/mailserver/modules/mailserver/main.tf`, `stacks/monitoring/modules/monitoring/grafana.tf`, `stacks/proxmox-csi/modules/proxmox-csi/main.tf`. +- **Docs are STALE:** `.claude/CLAUDE.md` says "43 ExternalSecrets + 9 DB-creds". Live count is **104 ExternalSecrets / 73 files / 26 db-refs**. Fix in the migration PR. + +### 2.3 The two ClusterSecretStores (`stacks/external-secrets/main.tf`) +Both `kubernetes_manifest`, both `external-secrets.io/v1beta1`, both `depends_on = [helm_release.external_secrets]`: +- **`vault-kv`** → Vault KV **v2** at `path = "secret"`, server `http://vault-active.vault.svc.cluster.local:8200`, auth `kubernetes` mount `kubernetes`, role `eso`, SA `external-secrets/external-secrets`. +- **`vault-database`** → identical except `path = "database"`, **`version = "v1"`** (Vault DB engine, KV-v1-style). + +ESO's Vault auth role `eso` (`stacks/vault/main.tf:486-511`): policy `eso-reader` (`secret/data/*` read+list, deny `secret/data/vault`, `database/static-creds/*` read), `token_ttl = token_period = 864000` (10d, periodic/auto-renew). + +### 2.4 Tier-0 / state +ESO is **Tier-0 (bootstrap)** (`.claude/CLAUDE.md` "Terraform State — Two-Tier Backend"; root `terragrunt.hcl` `tier0_stacks = ["infra","platform","cnpg","vault","dbaas","external-secrets"]`). Tier-0 ⇒ **local SOPS-encrypted state in git** (`state/stacks/external-secrets/terraform.tfstate`), NOT the PG backend. Workflow: `git pull` → `scripts/tg plan` → `scripts/tg apply` → `git push`; SOPS decrypt via Vault Transit (primary) → age fallback. **Tier-0 must apply before PG is reachable**, so the ESO upgrade cannot depend on PG. + +### 2.5 Provider versions (`stacks/external-secrets/providers.tf`) +- `required_providers` declares **only** `vault = hashicorp/vault, ~> 4.0`. +- `provider "kubernetes"` and `provider "helm"` are declared **without version constraints** (resolve from root / `.terraform.lock.hcl`). The `helm` block already uses the **v3-style nested `kubernetes = {…}` argument** (not the legacy `kubernetes {}` block) ⇒ helm provider is **v3.x or v4.x** in the lockfile. **No `kubectl` provider** in this stack. No `required_version` pinned here. +- ⚠️ **Verify the resolved helm provider version** in `.terraform.lock.hcl` before starting — the prompt referenced `~> 4.0` for helm; the *stack* only pins that for `vault`. Either way the v3-syntax helm block + an SDK-v3 provider is compatible with the chart (see §4.5). + +### 2.6 Plan-time coupling (the cross-cutting risk) +**15 stacks read ESO-created Secrets at plan time** via `data "kubernetes_secret"` (avoids a Vault dependency at plan): `actualbudget, affine, changedetection, coturn, ebooks, fire-planner, freedify, freshrss, grampsweb, k8s-dashboard (dashboard_injector.tf), navidrome, owntracks, real-estate-crawler, servarr, technitium (modules/technitium)`. + +The documented **first-apply gotcha** (`.claude/CLAUDE.md`, `docs/architecture/secrets.md:360`, `stacks/fire-planner/main.tf:574`): the Secret must exist before the `data "kubernetes_secret"` plans, so on first creation you must `terragrunt apply -target=kubernetes_manifest.<external_secret>` first, then full apply. **Why this matters for the migration:** the `kubernetes_manifest` provider treats `apiVersion` as part of resource identity, so bumping `v1beta1`→`v1` **forces a replace** of all 104 ExternalSecrets. During replace there is a window where the new CR (and thus the synced Secret) may not yet be materialized when the same stack's `data "kubernetes_secret"` plans → the two-phase `-target` apply is needed **fleet-wide for the v1 rewrite step, not just fire-planner.** + +### 2.7 Vault DB rotation (rotation interplay) +`stacks/vault/main.tf`: **25 `vault_database_secret_backend_static_role`, every one `rotation_period = 604800` (7 days)** (8 MySQL + 17 PostgreSQL static roles). ESO syncs these via `vault-database` → `remoteRef.key = "static-creds/<role>"`. Apps reading a rotated secret only at startup carry a Stakater Reloader annotation. **Implication:** any ESO controller downtime longer than the gap to the next rotation could leave a Secret stale across a rotation; keep controller downtime short and re-sync promptly. + +### 2.8 git-crypt landmine (adjacent, not in ESO stack) +`.claude/CLAUDE.md:146` + `docs/architecture/ci-cd.md:108` + `stacks/kyverno/modules/kyverno/tls-secret-sync.tf`: on a **git-crypt-locked clone**, `kubernetes_secret.tls_secret` reads `secrets/fullchain.pem`/`privkey.pem` via `file()` which returns **ciphertext**, corrupting the wildcard TLS secret Kyverno clones cluster-wide. **The ESO stack itself has NO `file()` reads of git-crypt secrets** — so this landmine does not bite the ESO upgrade directly. It is listed here only as a guardrail: do not piggyback unrelated kyverno applies during this work, and run all applies from an **unlocked** checkout. + +--- + +## 3. Target + +- Helm chart **`external-secrets` 2.6.0** (app **v2.6.0**), repo `https://charts.external-secrets.io`. +- All ExternalSecret + ClusterSecretStore CRs on **`external-secrets.io/v1`**. +- Cluster ESO compatible with **k8s 1.34–1.35** ⇒ unblocks the autonomous 1.35 upgrade. + +--- + +## 4. Key findings (the decisive facts) + +> Sourced from ESO official docs + GitHub release notes; verbatim quotes below. + +### 4.1 Chart version == app version (premise check) +The chart version and app version are released **in lockstep and are the same number**. `Chart.yaml`: `version: 0.12.1 / appVersion: v0.12.1`; `version: 2.6.0 / appVersion: v2.6.0`. The app series ran `…0.20.4 → 1.0.0 → … → 2.0.0 → … → 2.6.0`. **Crucially, the `v1.0.0` and `v2.0.0` APP releases are NOT the `external-secrets.io/v1` API** — `v1.0.0` is just "continuation after 0.20.4" (release diff `v0.20.4...v1.0.0`, no API change), and `v2.0.0`'s only breaking change is removing the unmaintained **Alibaba + Device42** providers (we use neither — only Vault). The API migration happened back at **0.16/0.17**. Source: [v1.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0) · [v2.0.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0). + +### 4.2 Version path: **NO skipping minors — step one minor at a time** +Official policy, verbatim ([stability-support](https://external-secrets.io/latest/introduction/stability-support/)): +> "**Upgrade version by version** — We strongly recommend upgrading one minor version at a time (e.g., 0.18.x → 0.19.x → 0.20.x) rather than skipping versions." + +Maintainer (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @gusfcarvalho): *"We are pre release… Every minor bump should be treated as a major bump until we go 1.0."* ⇒ **You CANNOT helm-upgrade 0.12.1 → 2.6.0 directly.** You must step each minor: `0.12 → 0.13 → 0.14 → 0.15 → 0.16 → 0.17 → 0.18 → 0.19 → 0.20 → 1.x → 2.x`. + +### 4.3 k8s ↔ ESO must advance roughly in lockstep +Each ESO release targets a **narrow** k8s band ([support matrix](https://external-secrets.io/latest/introduction/stability-support/)): + +| ESO | k8s band | +|---|---| +| 0.12.x | 1.19 → 1.31 | +| 0.16.x | 1.32 | +| 0.17.x | 1.33 | +| 2.0 – 2.5 | 1.34 – 1.35 | +| 2.6 (latest) | (matrix row not yet appended; 2.x band is consistently 1.34–1.35 — see Open Questions) | + +**This is the single most important sequencing constraint.** ESO doesn't "support only ≤ its max k8s" in a wide range — older ESO may not run cleanly on a *much newer* k8s either. The bands imply the ESO upgrade and the k8s upgrade need to be **interleaved**, not "finish ESO, then bump k8s in one jump." Practical reading: the cluster is currently on k8s ≤1.31 (ESO 0.12 blocks past it). The 0.16/0.17 steps want k8s 1.32/1.33; the 2.x steps want 1.34/1.35. So this is a **coordinated ESO+k8s climb**, e.g. ESO→0.16 alongside k8s→1.32, ESO→0.17 alongside k8s→1.33, then ESO→2.x alongside k8s→1.34→1.35. (The k8s climb is itself sequential via the version-check chain; this doc focuses on the ESO half but flags the coupling — see Open Questions for who drives the interleave.) + +### 4.4 API migration: **must rewrite manifests to `v1` FIRST — there is NO v1beta1→v1 conversion webhook** +- **`external-secrets.io/v1` promoted to STORAGE version: v0.16.0.** v0.16.0 release notes "BREAKING CHANGES": *"Promotion of ExternalSecret/v1 and SecretStore/v1 and their cluster counterparts"* and *"Removal of Conversion Webhooks and …/v1alpha1…"*. From 0.16, **etcd stores `v1`**. Source: [v0.16.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0). +- **`external-secrets.io/v1beta1` STOPS BEING SERVED (hard cutoff): v0.17.0.** Verbatim ([v0.17.0 notes](https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0)): + > "v0.17.0 Stops serving `v1beta1` apis. You need to update your manifests from `v1beta1` to `v1` prior to updating from `v0.16` to `v0.17`. The only change needed is upgrading your manifests to `v1` (i.e. removing the `beta1` from `v1beta1`). … Be sure to do that to all your manifests prior to bumping to `v0.17.0`! `v0.16.2` already supports `v1` so this process should be smooth." +- **No v1beta1→v1 conversion webhook.** The only conversion webhook that ever existed was v1alpha1→v1beta1, **removed in 0.16**. Maintainer (issue [#5478](https://github.com/external-secrets/external-secrets/issues/5478), @gusfcarvalho): the post-0.16 "drift" is simply that etcd now stores v1 — *"This isn't really a conversion issue."* ⇒ **old v1beta1 manifests do NOT keep working past 0.17 via any auto-conversion.** + - **Verdict: MUST-REWRITE-FIRST.** Rewrite all CRs to `v1` while on **0.16.x** (which serves *both* v1beta1 and v1), then upgrade to 0.17. Real-world confirmation (issue [#4785](https://github.com/external-secrets/external-secrets/issues/4785), @Dutchy-): *"I was able to change v1beta1 to v1 on 0.16 without issues. After that I was able to upgrade to 0.17."* + - There is a deprecated escape hatch in chart 2.6.0 — `unsafeServeV1Beta1: true` re-enables v1beta1 serving for stragglers — but its own values comment says *"This flag will be removed on 2026.05.01"* (i.e. **already past**, do not rely on it). +- **Schema change is a PURE apiVersion string bump — ZERO field changes.** CRD `openAPIV3Schema` diff (v0.16.2 bundle, which serves both): ExternalSecret / SecretStore / ClusterSecretStore / ClusterExternalSecret have **byte-identical** spec field sets between v1beta1 and v1 (`{data, dataFrom, refreshInterval, refreshPolicy, secretStoreRef, target}` for ExternalSecret). Maintainer (issue #4785, @Skarlso): *"Just change your manifests to be v1 and upgrade… We don't have anything fancy that you need to do."* PushSecret only ever had `v1alpha1` (no v1beta1) — **unaffected** (we have 0 anyway). + +### 4.5 Helm chart values + CRD handling (0.12 → 2.6) +- **No top-level values removed or renamed.** `values.yaml` diff 0.12.1↔2.6.0 is **additive only** (new keys: `enableHTTP2, extraInitContainers, genericTargets, grafanaDashboard, hostAliases, hostUsers, leaderElectionID, livenessProbe, openshiftFinalizers, processClusterGenerator, processClusterPushSecret, processSecretStore, readinessProbe, strategy, systemAuthDelegator, vault`). Our single value `installCRDs = true` survives. +- **`installCRDs` still works** in 2.6.0 (defaults `true`, "install and upgrade CRDs through helm chart"). CRDs are **templated into the single `external-secrets` chart** and **upgraded by `helm upgrade`** automatically — there is **no separate CRDs subchart**, and no manual `kubectl apply` of CRDs is required by default. (Out-of-band bundle, if ever needed, lives at `deploy/crds/bundle.yaml` per release tag.) The only CRD-value change: `crds.conversion.enabled` defaults `true` in 0.12.1 (for the old v1alpha1 webhook) → `false` in 2.6.0 ("we stopped supporting v1alpha1"). We don't set it, so the new default is fine. +- **CRD storedVersions bookkeeping (the one real pre-flight check):** v0.16.0 notes warn to ensure no CRD still lists `v1alpha1` in `.status.storedVersions` before/at 0.16, with a `kubectl patch` to set it to `["v1","v1beta1"]` if needed. This is CRD metadata hygiene, NOT secret deletion. +- **Helm provider:** `Chart.yaml apiVersion: v2` (Helm 3 chart) in both 0.12.1 and 2.6.0; **no minimum Helm version declared** (only `kubeVersion: ">= 1.19.0-0"`). The Terraform helm provider on Helm SDK v3 (v3.x/v4.x) is compatible. **The 2.x chart does NOT require a newer helm provider than 0.12 did** — the v3-style helm block in `providers.tf` already satisfies it. (Still: pin/verify the resolved version in the lockfile; see Open Questions.) + +### 4.6 Data migration: **downstream Secrets survive** +The synced Kubernetes `Secret` objects are **not deleted or force-resynced** by these upgrades. The change is an apiVersion bump on the *custom resources*, whose `spec` is schema-identical, so the controller keeps reconciling the same target Secrets. A controller restart triggers a normal **reconcile (re-assert, not delete)**. Caveat: no release note says verbatim "synced Secrets are preserved"; the conclusion is from (a) schema identity, (b) maintainers calling it "100% compatible" (issue #5478), (c) absence of any "secrets recreated/deleted" note. **Standard caution: snapshot/back up all ESO-created Secrets before the 0.16→0.17 step** (see §8 verification). Unrelated watch-item: v0.14.0 flagged a stateful-**generators** change — we use no generators, so N/A. + +--- + +## 5. Migration strategy (ordered, do-this-then-that) + +> **Pre-reqs every step:** run from an **unlocked** infra checkout (git-crypt unlocked); `vault login -method=oidc`; ESO is **Tier-0** so use `scripts/tg plan` / `scripts/tg apply` against `stacks/external-secrets` and **`git push`** after each apply (SOPS state). Claim presence before each apply: `~/code/scripts/presence claim stack:external-secrets --purpose "ESO 0.12→2.x migration step N"`. Wait for the controller `Deployment` to roll out healthy before the next hop. + +### Phase 0 — Pre-flight (no changes) +1. Confirm cluster k8s version and the version-check chain's current target; **coordinate with the k8s climb** (see §4.3 / Open Questions). Decide who drives the interleave. +2. `kubectl get crd | grep external-secrets.io` and for each: `kubectl get crd <name> -o jsonpath='{.status.storedVersions}'` — confirm none still list `v1alpha1`. If any do, plan the `kubectl patch …/status storedVersions=["v1beta1"]` per the v0.16.0 note (do this *before* reaching 0.16). +3. **Snapshot all ESO-managed Secrets** (rollback safety net): + `kubectl get externalsecrets -A` (record the 104) and `for ns/secret in <targets>: kubectl get secret -n <ns> <name> -o yaml > backup/<ns>-<name>.yaml`. Keep outside git-crypt or encrypt. +4. Inspect `.terraform.lock.hcl` in `stacks/external-secrets` — record resolved `helm` + `kubernetes` provider versions. If helm provider < what 2.6.0 needs (it doesn't appear to need anything beyond SDK v3), bump the constraint as its own committed change first. +5. Read `docs/architecture/secrets.md` + the fire-planner first-apply comment to re-confirm the `-target` pattern for the v1 rewrite step. + +### Phase 1 — Climb to 0.16.x (chart bump only, NO manifest change yet) +ESO `0.16.x` is the **transition version** that serves *both* v1beta1 and v1. Climb to it one minor at a time, leaving all CRs as `v1beta1`: +6. For `v` in `0.13.0, 0.14.0, 0.15.x, 0.16.2` (use latest patch of each minor): set `helm_release.external_secrets.version = "<v>"`, `scripts/tg plan` (expect: chart upgrade + CRD upgrade in place; **no `kubernetes_manifest` replacements** — apiVersion unchanged), `scripts/tg apply`, `git push`, wait for rollout, verify `kubectl get externalsecrets -A` all `SecretSynced=True`. + - **Interleave k8s as required:** before/at 0.16 the cluster should be on **k8s 1.32** (0.16 band). Advance k8s via the normal version-check chain to 1.32 around this point. + - Watch the **0.14.0** notes (generators) — N/A for us, but eyeball the plan diff anyway. +7. **Land on 0.16.2 and STOP.** Verify both APIs are served: `kubectl get externalsecrets.v1.external-secrets.io -A` and `kubectl get externalsecrets.v1beta1.external-secrets.io -A` both work. + +### Phase 2 — Rewrite all 104 CRs + 2 stores to `v1` (while on 0.16.2) +This is the MUST-DO-FIRST API migration, done in the safe window where both versions are served. +8. **Mechanical rewrite** across `stacks/`: replace the apiVersion string `external-secrets.io/v1beta1` → `external-secrets.io/v1` in every ExternalSecret and ClusterSecretStore `kubernetes_manifest` (104 + 2 = 106 occurrences across 73 files, **including the 6 nested-module files** in §2.2). **No other field changes** (schema identical). Do this in a worktree, committed file-by-file. + - Leave `secretStoreRef.kind = "ClusterSecretStore"` (that's a kind reference, not an apiVersion — unaffected). +9. **Two-phase apply because `kubernetes_manifest` replace + plan-time `data "kubernetes_secret"`:** + a. **Stores first:** `scripts/tg apply -target='kubernetes_manifest.css_vault_kv' -target='kubernetes_manifest.css_vault_db'` in `stacks/external-secrets` (they get replaced to v1; ESO still serves v1beta1 too, so in-flight ExternalSecrets keep syncing). `git push`. + b. **ExternalSecrets, per stack:** for each of the 73 stacks, `scripts/tg apply -target=kubernetes_manifest.<external_secret_name>` FIRST (materializes the replaced v1 CR + its Secret), THEN a full `scripts/tg apply` for that stack (lets the 15 plan-time `data "kubernetes_secret"` reads resolve against the now-existing Secret). The **15 plan-time-coupled stacks** (§2.6) absolutely need the `-target` first; the rest are lower-risk but follow the same pattern for safety. `git push` per stack (Tier-1 stacks use PG state; ESO stack is Tier-0). + - Because the spec is identical, the *replace* re-creates an identical CR; ESO reconciles and re-asserts the same target Secret (no value change) → apps keep their Secret throughout. +10. **Verify the rewrite fully landed:** `grep -rc 'external-secrets.io/v1beta1' stacks/` returns **0**; `kubectl get externalsecrets -A -o jsonpath used to confirm all served as v1`; all `SecretSynced=True`; spot-check a rotated DB cred (e.g. `nextcloud-db-creds`) still valid. + +### Phase 3 — Cross the 0.17 cutoff, then climb to 2.6.0 +Only after Phase 2 is 100% applied (zero v1beta1 in repo AND in etcd): +11. Bump chart `0.16.2 → 0.17.x`. `scripts/tg plan` (expect chart/CRD upgrade; **no manifest replacements** — already v1), apply, push, rollout, verify all synced. **k8s should be 1.33** (0.17 band) around here. +12. Continue one minor at a time: `0.18.x → 0.19.x → 0.20.x → 1.0.0 → 1.x (latest) → 2.0.0 → … → 2.6.0`. At each: bump `version`, plan, apply, push, rollout, verify synced. **k8s reaches 1.34 then 1.35** across the 2.x steps. + - **At 2.0.0:** confirm the plan shows nothing odd from the Alibaba/Device42 provider removal (we use only Vault — should be a no-op). +13. **Land on 2.6.0.** Verify: controller image `v2.6.0`, all 104 ExternalSecrets `SecretSynced=True`, both ClusterSecretStores `Valid=True`. + +### Phase 4 — Close the gate + docs +14. Advance k8s to **1.35** via the version-check chain if not already; confirm the **compat-gate now lists ESO as compatible** and 1.35 is unblocked. +15. Update `.claude/CLAUDE.md` Secrets Management section: correct counts (**104 ExternalSecrets / 73 files / 26 db-refs**), apiVersion now `v1`. Update `docs/architecture/secrets.md`. Commit as part of the work (audit trail). + +--- + +## 6. Risks & mitigations + +| Risk | Likelihood | Mitigation | +|---|---|---| +| **Secret-sync outage → app DB/API auth failures** during controller restarts or the replace window | Med | Spec is identical so re-sync re-asserts the same value; keep each controller restart short; do Phase-2 replaces **per stack** (small blast radius); the 15 plan-time stacks use `-target` first so the Secret exists before dependents plan. Pre-step Secret snapshot (Phase 0.3) for instant restore. | +| **Crossing 0.17 with any CR still v1beta1** → ESO stops reconciling those, secrets go stale | High if rushed | Phase 2 gate: `grep -rc v1beta1 stacks/` **must be 0** AND `kubectl get …v1beta1…` returns nothing live before Phase 3. Do not skip 0.16. | +| **CRD removal/replace by helm dropping data** | Low | Chart manages CRDs in-place via `installCRDs=true` (upgrade, not delete-recreate); CRs are the data and they're untouched by a CRD *upgrade*. Snapshot anyway. Never `helm uninstall` (that can GC CRDs). | +| **No conversion webhook safety net** (must-rewrite-first) | Certain (by design) | Whole strategy is built on rewriting at 0.16. The deprecated `unsafeServeV1Beta1` is already past its 2026-05-01 removal — do NOT rely on it. | +| **`kubernetes_manifest` forces replace on apiVersion bump** → transient gap + plan-time read failures | High | Two-phase `-target` apply fleet-wide (Phase 2.9); identical spec ⇒ replacement CR is equivalent. | +| **Vault 7-day DB rotation lands mid-migration** → a Secret stale across rotation if controller down | Med | Keep controller downtime < rotation gap; re-sync immediately after each hop; Reloader annotations already re-roll pods on Secret change; if a rotation is imminent, sequence the affected db stacks last and verify those creds explicitly. | +| **git-crypt tls-secret-sync landmine** | Low (not in ESO stack) | ESO stack has no `file()` git-crypt reads; run from an **unlocked** checkout; do **not** piggyback kyverno applies during this work. | +| **helm/k8s provider in lockfile too old for 2.x chart** | Low | Phase 0.4 verify; bump constraint as a separate committed change if needed (chart needs only Helm SDK v3, already satisfied). | +| **k8s/ESO band mismatch** (e.g. ESO 0.12 on k8s 1.33) | Med | Interleave the climbs per §4.3; don't jump k8s far ahead of ESO or vice-versa. | +| **Many small applies = long, error-prone session** | Med | Script the per-stack `-target`-then-full loop; checkpoint with `kubectl get externalsecrets -A` after each; the rewrite itself is a single `sed`-class change so low semantic risk. | + +--- + +## 7. Rollback plan (per hop) + +- **During Phase 1 (chart climb, still v1beta1):** revert `version` to the previous minor in `stacks/external-secrets/main.tf`, `scripts/tg apply`, `git push`. Helm rolls the controller back; CRs unchanged. Clean. +- **During Phase 2 (v1 rewrite, on 0.16.2):** 0.16.2 serves both APIs, so you can `git revert` the apiVersion-bump commits and re-apply — the CRs flip back to v1beta1 cleanly (both served). Secrets unaffected (identical spec). This is the **last point of easy rollback**. +- **After Phase 3 (≥0.17, v1beta1 no longer served):** **rollback is HARD** — once etcd stores v1-only and the controller is ≥0.17, downgrading cannot re-serve v1beta1 and v1 objects can't be auto-converted back ([general guidance + maintainer position](https://github.com/external-secrets/external-secrets/issues/5478)). Treat **crossing 0.17 as the point of no return.** If you must recover: re-install 0.16.2 (serves both), restore CRs from the Phase-0 manifest snapshot, and restore Secrets from the Secret snapshot. This is a disaster-recovery path, not a routine rollback — hence the Phase-2 gate must be airtight. +- **Always available:** the Phase-0.3 Secret backups let you `kubectl apply` the last-good Secret to keep an app authenticating while you fix ESO. + +--- + +## 8. Verification + +**Per hop:** +- `kubectl -n external-secrets get deploy,po` healthy; controller image tag == target. +- `kubectl get externalsecrets -A` → all 104 `STATUS=SecretSynced` / `READY=True`. +- `kubectl get clustersecretstores` → `vault-kv` + `vault-database` `Valid=True`. + +**After Phase 2 (v1 rewrite):** +- `grep -rc 'external-secrets.io/v1beta1' stacks/` → **0**. +- `kubectl get externalsecrets.v1beta1.external-secrets.io -A` → still served on 0.16 (sanity), but `kubectl get externalsecrets.v1.external-secrets.io -A` is the real check. +- Spot-check a rotated DB cred end-to-end: e.g. `nextcloud-db-creds` value matches `vault read database/static-creds/mysql-nextcloud` and the app authenticates. + +**Final (2.6.0):** +- Controller image `v2.6.0`; all ExternalSecrets synced; both stores valid. +- Diff a sample of the 104 target Secrets against the Phase-0 backups → values unchanged (continuity proof). +- App health: spot-check 3–4 high-value consumers (nextcloud, immich, grafana, a `vault-database` consumer) — pods running, no auth errors in logs. +- **Compat-gate:** run the upgrade-state / k8s-version-check audit — ESO no longer flagged as a 1.35 blocker; k8s 1.35 upgrade proceeds. + +--- + +## 9. Open questions + +1. **k8s/ESO interleave ownership.** §4.3 shows narrow per-version k8s bands (0.16→1.32, 0.17→1.33, 2.x→1.34-1.35). The cluster is currently ≤1.31. **Who drives the interleave** — does this migration also advance k8s step-by-step, or does the autonomous version-check chain advance k8s and we time ESO hops to it? Need the exact current k8s version and the chain's behavior when ESO is the only gate. (Decisive for sequencing Phases 1/3.) +2. **2.6.0 ↔ k8s 1.35 explicit support.** The support matrix table currently ends at **2.5** (k8s 1.34-1.35). 2.6.0 exists on GitHub but the matrix row isn't appended yet; the whole 2.x band is consistently 1.34-1.35, so 2.6 on 1.35 is a *strong inference* not a quoted row. Confirm via `Chart.yaml` `kubeVersion` of 2.6.0 or a 2.6 release note before relying on it. ([matrix](https://external-secrets.io/latest/introduction/stability-support/)) +3. **Resolved helm provider version.** The stack only pins `vault ~> 4.0`; helm/k8s are unpinned (lockfile-resolved). Confirm the lockfile version and whether to pin it explicitly as part of this work. (Chart needs only Helm SDK v3 — likely a no-op, but verify.) +4. **Intermediate-minor patch selection.** Use latest patch of each minor (0.13.x, 0.14.x, 0.15.x). Confirm 0.16.**2** specifically (the note says 0.16.2 already supports v1) vs a later 0.16 patch. +5. **Per-stack apply automation.** 73 stacks × (target + full) apply is large. Acceptable to script a loop, or prefer manual per-stack with checkpoints? Some stacks have other in-flight drift that a full apply would also push — needs a clean-plan check per stack first. +6. **Stateful generators / advanced features.** Confirmed we use none (0 SecretStore/PushSecret/ClusterExternalSecret/generators), so the v0.14 generator and v2.0 provider-removal breaking changes are N/A — but re-confirm no generator usage crept in before Phase 3. + +--- + +## 10. Sources (decisive facts) + +- Skip-version policy + k8s support matrix: <https://external-secrets.io/latest/introduction/stability-support/> +- `v1` promoted to storage version (0.16.0): <https://github.com/external-secrets/external-secrets/releases/tag/v0.16.0> +- `v1beta1` removed / "rewrite manifests to v1 first" (0.17.0): <https://github.com/external-secrets/external-secrets/releases/tag/v0.17.0> +- No conversion webhook / "not a conversion issue" (#5478): <https://github.com/external-secrets/external-secrets/issues/5478> +- v1beta1↔v1 schema identical / "nothing fancy" (#4785): <https://github.com/external-secrets/external-secrets/issues/4785> +- App v1.0.0 ≠ API v1: <https://github.com/external-secrets/external-secrets/releases/tag/v1.0.0> +- v2.0.0 only removes Alibaba/Device42: <https://github.com/external-secrets/external-secrets/releases/tag/v2.0.0> +- Chart 2.6.0 on ArtifactHub: <https://artifacthub.io/packages/helm/external-secrets-operator/external-secrets> diff --git a/docs/plans/2026-06-21-t3-idle-migrate-design.md b/docs/plans/2026-06-21-t3-idle-migrate-design.md new file mode 100644 index 00000000..46c43bfa --- /dev/null +++ b/docs/plans/2026-06-21-t3-idle-migrate-design.md @@ -0,0 +1,140 @@ +# t3 idle-migrate — graceful overnight restart of deferred t3-serve instances — design + +- **Date:** 2026-06-21 +- **Status:** implemented 2026-06-21 (branch `wizard/t3-idle-migrate`; deployed + timer enabled on devvm, first overnight drain pending) +- **Owner:** Viktor (wizard) +- **Builds on:** the gated nightly tracker `t3-autoupdate` (re-enabled 2026-06-16, `scripts/t3-autoupdate.{sh,service,timer}`; design history in `docs/runbooks/t3-version-bump.md` + post-mortem `2026-06-09-t3-nightly-autoupdate-auth-outage.md`) and the per-user `t3-serve@<user>` systemd instances (`scripts/t3-serve@.service`). + +## Goal + +When `t3-autoupdate` **defers** a user's `t3-serve` restart because that user has an active agent at the daily 04:00–05:00 window, the user's running server keeps executing its start-time t3 version indefinitely — their client (which tracks the freshly-installed global binary) then shows *"Client and server versions differ."* For a user who is busy at every daily window (wizard: long-lived/AFK sessions overnight), the deferral never resolves and the skew persists for days. + +Add a **small, idle-gated overnight job that drains those deferrals**: restart a deferred `t3-serve@<user>` onto the current binary **only when nothing is actively working** in that instance, so the migration happens during a genuine quiet gap rather than killing in-flight agent turns. + +## Background — why the skew persists (root cause, verified 2026-06-21) + +- All `t3-serve@<user>` instances share ONE global `/usr/bin/t3` (→ `/usr/lib/node_modules/t3`). `t3-autoupdate` installs a new nightly to that single binary, health-gates it against a **copy** of wizard's populated `state.sqlite`, then **canary-restarts idle instances one at a time**, verifying pairing after each (`scripts/t3-autoupdate.sh` step 6). +- Its idle check is coarse — `unit_busy()`: + ```sh + pid=$(systemctl show -p MainPID --value "$unit") + pgrep -aP "$pid" | grep -qiE 'claude|codex|opencode' + ``` + i.e. "does the server have any `claude`/`codex`/`opencode` **child**?" But `t3 serve` keeps one such child alive per **open** session, even one idle awaiting input. Live snapshot 2026-06-21: wizard had **5 `running` provider sessions** (= 5 `claude` children) but only **3 mid-turn**, plus **89 `ready` (open-idle)** threads. So `unit_busy` is true whenever any tab is open → wizard is deferred at every window. +- The job runs **once daily** (`OnCalendar=*-*-* 04:00:00`, `RandomizedDelaySec=1h`, `Persistent` deliberately omitted) and **only acts on a version bump** (exits early if `installed == target`). So once the binary is already current, nothing re-triggers a restart of a still-stale running server until the *next* new nightly — and only if the user happens to be idle then. +- Confirmed in the logs: `t3-autoupdate: deferring t3-serve@wizard.service (active agent) — migrates on its next idle restart` on **both** Jun 20 and Jun 21 windows; wizard's server has been up since Jun 20 06:17 on `…20260620.605` while the binary + client are on `…20260621.613`. + +## Decisions (from brainstorm 2026-06-21) + +1. **"Safe to restart" = no turn in flight AND a quiet buffer.** Not "zero open sessions" (that would essentially never fire for wizard). Open-but-idle tabs are acceptable to drop — t3 persists thread history in `state.sqlite` and the client reconnects/resumes (the daily job already restarts idle instances routinely; restart→resume is the exercised path). To verify during implementation: the user-facing resume after a server restart. +2. **Cadence: overnight window only.** Frequent checks within a fixed overnight window; never disconnects tabs during the working day. Migrates within ~1 night of a build landing. +3. **Scope: all `t3-serve@<user>`, self-limiting.** The job restarts only an instance that actually *owes* a migration (a deferral marker exists). Users already migrated at the daily window have no marker → no-op. No hardcoded per-user logic. +4. **Approach C: extract a shared safe-restart helper, reuse from both jobs.** One audited copy of the dangerous backup→restart→verify→recover logic; the new job adds only *scheduling + gating*. + +## Constraints (load-bearing) + +1. **The binary is global; migrations are forward-only and per-user-DB.** You cannot keep one user on the old version while others run the new one. A real-user forward-migration failure therefore means the build is unsafe for a real user → the only consistent recovery is the daily job's existing one (restore that user's DB + roll the **global** binary back to last-good + freeze + alert). This is a rare tail (the build was already migration-gated against a copy of wizard's real DB at install time), but the idle path must not invent a weaker recovery. +2. **Per-user secret boundary.** A user's `~/.t3/userdata/state.sqlite` is mode 600 and may not be read as another user. The job runs as root (system service) but reads each user's DB **as that user** via `runuser -u <user> -- sqlite3 …` (the pattern `backup_all` already uses), read-only (`mode=ro`) so it never locks the live WAL. +3. **Fail closed.** Any uncertainty about whether an instance is safe to restart (DB locked/busy/unreadable, query error, unparseable timestamp) → treat as *not safe*, skip this tick, retry in 20 min. Never restart on doubt. +4. **Do not change the daily job's gated-install behavior.** The step-6 extraction must be behavior-preserving; health-gate, canary, downgrade-guard, freeze, and rollback stay exactly as today. +5. **Infra-as-code via the devvm installer.** Sources live in `scripts/`; deployment is `scripts/workstation/setup-devvm.sh` (the devvm is hand-managed VM 102 — no Terraform apply). Shared-devvm deploy takes a presence claim. + +## Design + +### Components + +Four new files in `scripts/` + a one-line addition to the existing job: + +1. **`scripts/t3-safe-restart.sh`** — shared library, sourced (not executed). Holds the per-unit "dangerous" routine extracted from `t3-autoupdate.sh` step 6 as `safe_restart_unit <unit> <target>`: + pre-restart `VACUUM INTO` backup (as the owner) → `systemctl restart` → poll `verify_pairing` (15×2s ≈ 30s) → on failure: restore that user's DB from the pre-restart backup, `rollback_binary` to last-good, `touch $FREEZE_FILE`, log+alert. The shared helpers it needs (`LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `DISPATCH`/`BACKUP_DIR`/… config) move into the lib too. Installed to `/usr/local/lib/t3-safe-restart.sh`. + **Contract:** returns `0` on verified success, **non-zero** after performing recovery+freeze on failure. This is the one non-verbatim change to step-6 logic — today it `exit 1`s inline; the extracted function `return`s instead so the *caller* decides (the daily job `exit 1`s on non-zero exactly as today; the idle job `break`s). Behavior is otherwise identical. + +2. **`scripts/t3-migrate-idle.sh`** — the new job (scheduling + gating only). Installed to `/usr/local/bin/t3-migrate-idle`. Sources the lib; per tick, drains the deferral directory (control flow below). + +3. **`scripts/t3-migrate-idle.service`** — `Type=oneshot`, `ExecStart=/usr/local/bin/t3-migrate-idle`. (No `EnvironmentFile` needed; env-overridable knobs have defaults.) + +4. **`scripts/t3-migrate-idle.timer`** — overnight window, frequent checks: + ```ini + [Timer] + OnCalendar=*-*-* 01..05:00/20 # fires 01:00,01:20,…,05:40; none at/after 06:00. System TZ (UTC) — tune the window. + Persistent=false # never replay a missed migrate-restart at an unpredictable time + RandomizedDelaySec=120 + ``` + +5. **One-line edit to `t3-autoupdate.sh`** — in the existing defer branch, *also record* the deferral: + ```sh + LOG "deferring $unit (active agent) — migrates on its next idle restart" + mkdir -p "$DEFER_DIR" 2>/dev/null; printf '%s\n' "$target" > "$DEFER_DIR/$u" # NEW + deferred=$((deferred+1)); continue + ``` + where `DEFER_DIR=/var/lib/t3-autoupdate/deferred`. This is the *only* behavioral change to the scarred script beyond the verbatim step-6 extraction. + +### Why a deferral marker (not version-introspection) + +The marker makes "which instances owe a restart" **exact** and decouples it from the binary-is-current problem — the daily job already *knows* it deferred wizard, so it records that fact. The idle job drains the directory; the version string in the marker is informational (a restart always picks up whatever binary is current). The marker is removed only after the restart's pairing is verified. + +### Control flow of `t3-migrate-idle` (per tick) + +``` +for marker in $DEFER_DIR/*: # nothing deferred → no-op + user = basename(marker); unit = t3-serve@<user>.service + [ unit is an active running service ] or { rm marker; continue } # gone + if unit ActiveEnterTimestamp > mtime(marker): rm marker; continue # already restarted (manual/other) → just clear + if not safe_to_restart(user): continue # mid-turn or not quiet → try next tick + target = contents(marker) + if safe_restart_unit(unit, target): rm marker # success: verified on new binary + else: # helper already restored DB + rolled back binary + froze + alerted + break # frozen: stop draining; a human investigates +``` + +### `safe_to_restart(user)` — the gate + +Single read-only query, run as the user: + +```sh +runuser -u "$user" -- sqlite3 "file:/home/$user/.t3/userdata/state.sqlite?mode=ro" " + SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') + - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" +``` + +- Column 1 = **active turns**; must be `0`. (`active_turn_id` is set exactly while a turn runs — verified 2026-06-21.) +- Column 2 = **idle seconds** = now − most-recent thread activity. Must be `≥ QUIET_SECONDS` (default **900** = 15 min, env-overridable). `updated_at` is ISO-8601 `…Z`; `datetime('now')`/`julianday('now')` are UTC, so normalizing `T`/`Z` away before `julianday()` keeps the arithmetic correct without depending on a newer SQLite's `Z` parsing. +- **NULL idle** (no threads at all) ⇒ safe. **Any error / non-numeric / nonzero exit** ⇒ not safe (constraint 3). + +### Failure recovery + +Delegated entirely to `safe_restart_unit` (the extracted, already-proven path): restore the user's DB from the pre-restart backup, roll the global binary back to last-good, `touch /etc/t3-autoupdate.freeze`, log+alert. The idle job then stops draining (the freeze halts both jobs until a human clears it) — see constraint 1 for why per-user divergence isn't an option. + +### Observability + +- Structured `logger -t t3-migrate-idle` lines; extend the existing `T3AutoUpdate*` Loki ruler/alerts to also match this tag. Success → one line: `migrated t3-serve@wizard → <target> (idle restart; idle 47m)`. Failure → reuses the daily job's freeze+alert. +- **Recommended (optional):** a Pushgateway gauge for **deferral-marker age** + an alert if a marker survives **> 3 days** — passive visibility into "busy every night for 3 days," *not* the auto-escalation/daytime-widening that was explicitly de-scoped. + +### Delivery + +- Wire into `scripts/workstation/setup-devvm.sh` alongside the existing units: + - `install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh` + - `install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle` + - add `t3-migrate-idle.service t3-migrate-idle.timer` to the unit-install loop (→ `/etc/systemd/system/`) + - add `t3-migrate-idle.timer` to the `systemctl enable --now` list +- `homelab claim host:devvm --purpose "deploy t3-migrate-idle units"` before the install + enable on the shared devvm. +- No Terraform (hand-managed VM 102). + +## Testing + +- **TDD on the gating core (`bats`)** against fixture `state.sqlite` files: active turn → unsafe; idle-but-recent (< QUIET) → unsafe; idle + quiet → safe; empty DB → safe; locked/garbage DB / sqlite error → unsafe (fail-closed); marker drain: unit started after marker → clear+skip, before → eligible. +- **`T3_DRY_RUN=1`** mode logs `would migrate <unit> → <target>` without acting. Roll out in dry-run first; confirm it flags wizard's server at a real overnight idle moment; then enable live. +- **Step-6 extraction is behavior-preserving** — validate the daily job's decisions are unchanged via a dry-run diff before/after the refactor. + +## Out of scope (YAGNI) + +- Daytime restarts / "around the clock" cadence (de-scoped: overnight only). +- Auto-escalation that widens to a daytime attempt after N stale nights (de-scoped; the optional marker-age alert covers visibility). +- Per-user opt-out file (not needed — the job is self-limiting via markers). +- Any change to how `t3-autoupdate` *installs/gates* a build. + +## Open questions + +None outstanding from the brainstorm. Two items to **verify during implementation** (not blockers): (a) user-facing session resume after a `t3-serve` restart; (b) the devvm's `sqlite3` parses the normalized timestamp as expected (the `replace()` normalization is the safeguard). diff --git a/docs/plans/2026-06-21-t3-idle-migrate-plan.md b/docs/plans/2026-06-21-t3-idle-migrate-plan.md new file mode 100644 index 00000000..ed75e234 --- /dev/null +++ b/docs/plans/2026-06-21-t3-idle-migrate-plan.md @@ -0,0 +1,729 @@ +# t3 idle-migrate Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add a small idle-gated overnight job that restarts a `t3-serve@<user>` deferred by the daily autoupdate, so a chronically-busy user's server migrates onto the current t3 binary during a real quiet gap instead of staying version-skewed for days. + +**Architecture:** Extract the daily job's per-unit "dangerous" restart routine (backup→restart→verify→recover) into a sourced shared library `t3-safe-restart.sh`; the daily `t3-autoupdate` and a new `t3-migrate-idle` job both call it. The daily job records each deferral as a marker file; the new job drains markers overnight, restarting only when `state.sqlite` shows no in-flight turn and a quiet buffer has elapsed. Self-limiting (only acts on a recorded deferral), fail-closed. + +**Tech Stack:** bash, systemd timers, sqlite3 (reading t3's `state.sqlite`), the existing `t3-autoupdate` machinery. Deployed via `scripts/workstation/setup-devvm.sh` on the hand-managed devvm (no Terraform). + +**Design:** `docs/plans/2026-06-21-t3-idle-migrate-design.md`. + +--- + +## File structure + +- **Create `scripts/t3-safe-restart.sh`** — sourced library: shared config defaults, `LOG`/`ver`/`osusers`/`ak_for`/`verify_pairing`/`backup_user`/`prebump_of`/`rollback_binary`, and `safe_restart_unit`. One responsibility: the audited per-unit safe restart + its recovery. +- **Modify `scripts/t3-autoupdate.sh`** — source the lib; replace the inline helpers + step-6 body with calls into it; write/clear the deferral marker. Behavior unchanged. +- **Create `scripts/t3-migrate-idle.sh`** — the new job: the idle gate (`gate_query`/`gate_is_safe`/`safe_to_restart`) + the marker-drain loop. Main logic behind a `main`-guard so it's source-safe for tests. +- **Create `scripts/t3-migrate-idle.service`** + **`scripts/t3-migrate-idle.timer`** — oneshot + overnight timer. +- **Create `tests/t3-migrate-idle-gate.test.sh`** — pure-bash TDD for the gate predicates against fixture SQLite DBs (no root, no bats). +- **Modify `scripts/workstation/setup-devvm.sh`** — install + enable the new files. +- **Modify `docs/runbooks/t3-version-bump.md`** + **`.claude/reference/service-catalog.md`** — document the new job. + +**Recovery semantics note (load-bearing):** `safe_restart_unit` is reused verbatim. In the *daily* path a canary failure happens when `last_good < target`, so its `rollback_binary` genuinely reverts the global binary (correct — a bad build is bad for everyone). In the *idle* path `last_good == installed == target` (the build was already accepted), so `rollback_binary` is a **harmless no-op reinstall** — recovery reduces to "restore the failing user's DB + freeze + alert" and does NOT downgrade other users. Known rare-tail limitation: if that user's forward migration genuinely fails at idle time (already gated against a copy of their real DB at install), their server may crashloop on the restored DB until a human acts on the freeze+alert. Documented, not hidden. + +--- + +## Task 1: Shared library `t3-safe-restart.sh` + +**Files:** +- Create: `scripts/t3-safe-restart.sh` + +- [ ] **Step 1: Create the library** + +```bash +#!/usr/bin/env bash +# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh +# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer). +# +# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing -> +# recover (restore DB + roll global binary back to last-good + freeze) — extracted +# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on. +# The only change from the inline original: safe_restart_unit RETURNS non-zero on +# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER +# decides what to do (the daily job exits; the idle job stops draining). +# +# Callers must set, before calling safe_restart_unit: $target (version being moved +# TO, for log lines + the prebump filename) and $last_good (rollback target). +# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle"). + +# ---- shared config defaults (override via env before sourcing) ------------------ +: "${LOG_TAG:=t3-safe-restart}" +: "${FREEZE_FILE:=/etc/t3-autoupdate.freeze}" +: "${STATE_DIR:=/var/lib/t3-autoupdate}" +: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}" +: "${DEFER_DIR:=$STATE_DIR/deferred}" +: "${BACKUP_DIR:=/var/backups/t3-state}" +: "${DISPATCH:=127.0.0.1:3780}" +: "${USER_MAP:=/etc/ttyd-user-map}" +: "${T3_BACKUP_TIMEOUT:=900}" + +LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; } +ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; } +# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line). +osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; } +# authentik username for an OS user (reverse map; first match) — for dispatch verify. +ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; } + +# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the +# WAL stays owned; never stops the serve). Uses global $target for the filename. +# Echoes the backup path on success; non-zero on failure. +backup_user() { + local u="$1" src out dst ts + src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1 + ts="$(date +%Y%m%d-%H%M%S)" + out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite" + install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out" + if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then + printf '%s\n' "$dst"; return 0 + fi + rm -f "$dst"; return 1 +} + +# newest pre-bump backup for a user taken for the current $target (restore source). +prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; } + +# roll the GLOBAL binary back to last-good. In the idle path last_good==installed, +# so this is a harmless no-op reinstall (does NOT downgrade other users). +rollback_binary() { + LOG "rolling back binary $target -> $last_good" + if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi + LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1 +} + +# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie). +verify_pairing() { + local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; } + out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)" + printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session=' +} + +# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure +# restore the user's DB from its pre-restart backup, roll the binary back, freeze. +# Assumes a pre-restart backup already exists for <user> at the current $target +# (the daily job's backup_all, or the idle job's backup_user, takes it first). +# Returns 0 on verified success, non-zero after recovery+freeze on failure. +safe_restart_unit() { + local unit="$1" u="$2" ok=0 _ bak + systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero" + for _ in $(seq 1 15); do + if verify_pairing "$u"; then ok=1; break; fi + sleep 2 + done + if [ "$ok" = "1" ]; then + LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0 + fi + LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB" + rollback_binary + bak="$(prebump_of "$u")" + if [ -n "$bak" ]; then + systemctl stop "$unit" 2>/dev/null + if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then + rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm" + LOG "restored $u state.sqlite from $bak" + fi + systemctl start "$unit" 2>/dev/null + fi + touch "$FREEZE_FILE" 2>/dev/null + LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume" + return 1 +} +``` + +- [ ] **Step 2: Syntax + lint check** + +Run: `bash -n scripts/t3-safe-restart.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-safe-restart.sh || echo "shellcheck absent — skipped")` +Expected: no syntax errors. (shellcheck may warn on the intentional global `$target`/`$last_good` references — acceptable; they are documented caller-set globals.) + +- [ ] **Step 3: Source-and-define smoke test** + +Run: +```bash +bash -c 'LOG_TAG=test; . scripts/t3-safe-restart.sh; for f in LOG ver osusers ak_for backup_user prebump_of rollback_binary verify_pairing safe_restart_unit; do declare -F "$f" >/dev/null || { echo "MISSING $f"; exit 1; }; done; echo "all functions defined"' +``` +Expected: `all functions defined` (sourcing has no side effects — no exit, no output beyond the echo). + +- [ ] **Step 4: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-safe-restart.sh +git "${GC[@]}" commit -m "t3-safe-restart: extract shared safe-restart library from t3-autoupdate + +Pull the per-unit backup->restart->verify->recover routine (and the small +helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second +job (the upcoming idle migrator) can reuse the exact same audited recovery path +instead of forking safety-critical code. safe_restart_unit returns non-zero on +failure (after recovery+freeze) rather than exiting, so callers control flow. + +Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>" +``` + +--- + +## Task 2: Refactor `t3-autoupdate.sh` to use the library + record deferrals + +**Files:** +- Modify: `scripts/t3-autoupdate.sh` (config block 32–42, helpers 44–165, step 6 loop 194–225) + +- [ ] **Step 1: Source the library; drop the now-shared helpers** + +Replace lines 32–52 (the `T3_*` config block through the `newer()` helper) with — keep the autoupdate-only config, source the lib for the shared bits: + +```bash +# ---- autoupdate-specific config (shared config + helpers come from the lib) ----- +T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest) +T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking) +SMOKE_PORT="${T3_SMOKE_PORT:-3799}" +DRY_RUN="${T3_DRY_RUN:-0}" +TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it + +LOG_TAG=t3-autoupdate +# shellcheck source=scripts/t3-safe-restart.sh +. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}" + +# is $1 a strictly-newer version than $2 (version-sort)? +newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; } + +mkdir -p "$STATE_DIR" 2>/dev/null || true +``` + +(The lib now provides `FREEZE_FILE`, `STATE_DIR`, `LAST_GOOD_FILE`, `DEFER_DIR`, `BACKUP_DIR`, `DISPATCH`, `USER_MAP`, `LOG`, `ver`, `osusers`, `ak_for`, `verify_pairing`, `prebump_of`, `rollback_binary`, `backup_user`, `safe_restart_unit`.) + +- [ ] **Step 2: Simplify `backup_all` to call the shared `backup_user`** + +Replace the `backup_all()` definition (lines 90–105) with: + +```bash +ADMIN_SEED="" +backup_all() { + local u dst + for u in $(osusers); do + if dst="$(backup_user "$u")"; then + LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)" + [ "$u" = "wizard" ] && ADMIN_SEED="$dst" + else + LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)" + fi + done + [ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)" +} +``` + +Delete the now-duplicated standalone `prebump_of`, `rollback_binary`, and `verify_pairing` definitions (lines 107–108, 146–152, 160–165) — they come from the lib. Keep `health_check` and `unit_busy` (autoupdate-only). + +- [ ] **Step 3: Use `safe_restart_unit` + write/clear the deferral marker in step 6** + +Replace the step-6 loop body (lines 196–225) with: + +```bash +for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do + u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue + if unit_busy "$unit"; then + LOG "deferring $unit (active agent) — migrates on its next idle restart" + mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle + deferred=$((deferred+1)); continue + fi + if safe_restart_unit "$unit" "$u"; then + restarted=$((restarted+1)) + rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker + else + exit 1 # frozen by safe_restart_unit — preserve today's behavior + fi +done +``` + +- [ ] **Step 4: Syntax check + behavior-preserving dry-run diff** + +Run: +```bash +bash -n scripts/t3-autoupdate.sh +# Confirm the only remaining defer/restart decisions are unchanged vs HEAD~1 logic: +git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false diff HEAD scripts/t3-autoupdate.sh | grep -E '^\+|^-' | grep -vE 'safe_restart_unit|backup_user|DEFER_DIR|source|\. "|LOG_TAG|^\+\+\+|^---' | head -40 +``` +Expected: no syntax errors; the diff shows only the extraction (calls replacing inline bodies) + the two marker lines — no change to install/health-gate/canary decision logic. + +- [ ] **Step 5: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-autoupdate.sh +git "${GC[@]}" commit -m "t3-autoupdate: source the shared safe-restart lib + record deferrals + +Behavior-preserving refactor: the per-unit restart/recover body and small helpers +now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is +deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/ +so the new idle migrator can drain it later; clear the marker on a successful +restart. Install/health-gate/canary logic is unchanged. + +Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>" +``` + +--- + +## Task 3: The idle gate (TDD) — `gate_query` + `gate_is_safe` + +**Files:** +- Create: `tests/t3-migrate-idle-gate.test.sh` +- Create (incremental): `scripts/t3-migrate-idle.sh` (gate functions only this task) + +- [ ] **Step 1: Write the failing test** + +Create `tests/t3-migrate-idle-gate.test.sh`: + +```bash +#!/usr/bin/env bash +# Pure-bash unit tests for the t3-migrate-idle gate. No root, no bats, no Docker. +# Sources t3-migrate-idle.sh (main-guarded) with the lib path pointed at the worktree. +set -uo pipefail +HERE="$(cd "$(dirname "$0")/.." && pwd)" # repo root (tests/ is one level down) +export T3_SAFE_RESTART_LIB="$HERE/scripts/t3-safe-restart.sh" +# shellcheck source=/dev/null +. "$HERE/scripts/t3-migrate-idle.sh" # defines functions; main-guard prevents the drain from running + +pass=0; fail=0 +ok() { if "$@"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $*"; fi; } +notok(){ if "$@"; then fail=$((fail+1)); echo "FAIL (expected non-zero): $*"; else pass=$((pass+1)); fi; } + +# --- gate_is_safe <active> <idle_seconds> with QUIET_SECONDS=900 --- +QUIET_SECONDS=900 +ok gate_is_safe 0 1000 # idle, quiet long enough -> safe +notok gate_is_safe 1 1000 # a turn in flight -> unsafe +notok gate_is_safe 0 100 # idle but not quiet enough -> unsafe +ok gate_is_safe 0 "" # no threads at all (NULL idle) -> safe +notok gate_is_safe x 1000 # unparseable active -> unsafe +notok gate_is_safe 0 -30 # negative idle (clock skew) -> unsafe + +# --- gate_query <db> against fixture SQLite DBs --- +TMP="$(mktemp -d)"; trap 'rm -rf "$TMP"' EXIT +mkfix() { # mkfix <file> ; reads rows "active_turn_id|updated_at" on stdin + local f="$1"; sqlite3 "$f" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);" + while IFS='|' read -r a u; do sqlite3 "$f" "INSERT INTO projection_thread_sessions VALUES ($([ "$a" = NULL ] && echo NULL || echo "'$a'"), '$u');"; done +} +NOW="$(date -u +%Y-%m-%dT%H:%M:%S.000Z)" +OLD="$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S.000Z)" + +# active turn present -> "1|<small idle>" +printf '%s\n' "abc|$NOW" "NULL|$OLD" | mkfix "$TMP/active.db" +res="$(gate_query "$TMP/active.db")"; ok test "${res%%|*}" = "1" + +# all idle, last activity 1h ago -> "0|>=3500" +printf '%s\n' "NULL|$OLD" "NULL|$OLD" | mkfix "$TMP/idle.db" +res="$(gate_query "$TMP/idle.db")"; ok test "${res%%|*}" = "0"; ok test "${res##*|}" -ge 3500 + +# empty table -> "0|" (NULL idle) +sqlite3 "$TMP/empty.db" "CREATE TABLE projection_thread_sessions(active_turn_id TEXT, updated_at TEXT NOT NULL);" +res="$(gate_query "$TMP/empty.db")"; ok test "${res%%|*}" = "0" + +echo "PASS=$pass FAIL=$fail"; [ "$fail" -eq 0 ] +``` + +- [ ] **Step 2: Run it to verify it fails** + +Run: `bash tests/t3-migrate-idle-gate.test.sh` +Expected: FAIL — `scripts/t3-migrate-idle.sh` does not exist yet (source error). + +- [ ] **Step 3: Create `scripts/t3-migrate-idle.sh` with the gate functions + main-guard skeleton** + +```bash +#!/usr/bin/env bash +# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight +# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively +# working in that instance (no in-flight turn + a quiet buffer), restart it onto the +# current binary using the shared safe_restart_unit, then clear the marker. +# Why this exists: t3-autoupdate defers a user with an active agent at its single +# daily window; a user busy every night never migrates and their client shows +# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*. +set -uo pipefail + +LOG_TAG=t3-migrate-idle +# shellcheck source=scripts/t3-safe-restart.sh +. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}" + +QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min) +DRY_RUN="${T3_DRY_RUN:-0}" + +# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed. +gate_is_safe() { + local active="$1" idle="$2" + case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe + [ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe + [ -z "$idle" ] && return 0 # no threads at all -> safe + case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe + [ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe +} + +# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>". +# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday. +gate_query() { + local db="$1" + sqlite3 -batch -noheader -separator '|' "$db" \ + "SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" +} + +# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe. +safe_to_restart() { + local u="$1" db row + db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1 + row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \ + "SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" 2>/dev/null)" || return 1 + gate_is_safe "${row%%|*}" "${row##*|}" +} + +main() { + : # drain loop added in Task 4 +} + +# main-guard: run only when executed, not when sourced (tests source this file). +if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi +``` + +- [ ] **Step 4: Run the test to verify it passes** + +Run: `bash tests/t3-migrate-idle-gate.test.sh` +Expected: `PASS=10 FAIL=0` (exit 0). + +- [ ] **Step 5: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-migrate-idle.sh tests/t3-migrate-idle-gate.test.sh +git "${GC[@]}" commit -m "t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD + +The gate reads t3's state.sqlite: safe to restart only when zero threads have an +active_turn_id AND the most-recent thread activity is older than the quiet buffer +(default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover +the boundaries against fixture DBs (no root/bats/Docker). + +Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>" +``` + +--- + +## Task 4: The marker-drain loop in `t3-migrate-idle.sh` + +**Files:** +- Modify: `scripts/t3-migrate-idle.sh` (replace the `main()` skeleton) + +- [ ] **Step 1: Implement `main()` (the drain loop)** + +Replace the `main() { : ; }` skeleton with: + +```bash +main() { + # a frozen build must not be auto-migrated (shared switch with t3-autoupdate) + if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi + [ -d "$DEFER_DIR" ] || exit 0 # nothing deferred + last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper + + local marker u unit started mwritten migrated=0 skipped=0 + for marker in "$DEFER_DIR"/*; do + [ -e "$marker" ] || continue # empty-dir glob + u="$(basename "$marker")"; unit="t3-serve@$u.service" + if ! systemctl is-active --quiet "$unit"; then + LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue + fi + started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)" + mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)" + if [ "$started" -gt "$mwritten" ]; then + LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue + fi + if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi + + target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)" + if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi + if ! backup_user "$u" >/dev/null; then + LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue + fi + if safe_restart_unit "$unit" "$u"; then + LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1)) + else + LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1 + fi + done + LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)" +} +``` + +- [ ] **Step 2: Re-run the gate tests (regression — main-guard still source-safe)** + +Run: `bash tests/t3-migrate-idle-gate.test.sh` +Expected: `PASS=10 FAIL=0` (sourcing still defines functions without running the loop). + +- [ ] **Step 3: Syntax + lint** + +Run: `bash -n scripts/t3-migrate-idle.sh && (command -v shellcheck >/dev/null && shellcheck -x scripts/t3-migrate-idle.sh || echo "shellcheck absent — skipped")` +Expected: no syntax errors. + +- [ ] **Step 4: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-migrate-idle.sh +git "${GC[@]}" commit -m "t3-migrate-idle: drain deferral markers when safe + +For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is +gone or was already restarted after the deferral; otherwise, when the idle gate is +satisfied, take a pre-restart backup and restart via the shared safe_restart_unit, +clearing the marker on verified success. DRY_RUN logs decisions without acting. + +Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>" +``` + +--- + +## Task 5: systemd units + +**Files:** +- Create: `scripts/t3-migrate-idle.service`, `scripts/t3-migrate-idle.timer` + +- [ ] **Step 1: Create the service unit** + +`scripts/t3-migrate-idle.service`: +```ini +[Unit] +Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle +Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md +After=network.target t3-dispatch.service + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/t3-migrate-idle +``` + +- [ ] **Step 2: Create the timer unit** + +`scripts/t3-migrate-idle.timer`: +```ini +[Unit] +Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration) + +[Timer] +OnCalendar=*-*-* 01..05:00/20 +RandomizedDelaySec=120 +Persistent=false + +[Install] +WantedBy=timers.target +``` + +- [ ] **Step 3: Validate unit syntax** + +Run: `systemd-analyze verify scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer 2>&1 | grep -v 'Unknown\|Cannot find' || echo "units parse OK"` +Expected: no fatal parse errors (warnings about the `[Install]` of a non-installed unit / missing exec on a non-deployed path are acceptable in the worktree). + +- [ ] **Step 4: Confirm the OnCalendar expands to the intended overnight slots** + +Run: `systemd-analyze calendar '*-*-* 01..05:00/20' --iterations=5` +Expected: next elapses at 01:00/01:20/01:40/02:00/… (every 20 min, hours 01–05). + +- [ ] **Step 5: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/t3-migrate-idle.service scripts/t3-migrate-idle.timer +git "${GC[@]}" commit -m "t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20) + +Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>" +``` + +--- + +## Task 6: Wire into `setup-devvm.sh` + +**Files:** +- Modify: `scripts/workstation/setup-devvm.sh` (9a install ~line 164; 9d unit loop ~line 200; enable ~line 218) + +- [ ] **Step 1: Install the lib + the new script (section 9a)** + +After the `install -m 0755 "$SCRIPTS/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate` line, add: +```bash +install -m 0644 "$SCRIPTS/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh +install -m 0755 "$SCRIPTS/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle +``` + +- [ ] **Step 2: Install the unit files (section 9d loop)** + +Add to the `for u in …` unit list (after the `t3-autoupdate.service t3-autoupdate.timer \` line): +```bash + t3-migrate-idle.service t3-migrate-idle.timer \ +``` + +- [ ] **Step 3: Enable the timer (section 9 enable line)** + +Append `t3-migrate-idle.timer` to the `systemctl enable --now` list: +```bash +systemctl enable --now t3-dispatch.service \ + t3-autoupdate.timer t3-backup-state.timer t3-provision-users.timer t3-migrate-idle.timer >/dev/null 2>&1 || \ + log "WARN: some units failed to enable (check: systemctl status t3-dispatch t3-*.timer)" +``` + +- [ ] **Step 4: Syntax check** + +Run: `bash -n scripts/workstation/setup-devvm.sh` +Expected: no syntax errors. + +- [ ] **Step 5: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add scripts/workstation/setup-devvm.sh +git "${GC[@]}" commit -m "setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer) + +Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>" +``` + +--- + +## Task 7: Deploy to the devvm + validate (dry-run first) + +**Files:** none (operational). Presence-claimed, shared-host mutation. + +- [ ] **Step 1: Claim the host** + +Run: `homelab claim host:devvm --purpose "deploy t3-migrate-idle units (idle-gated t3-serve migration)"` +Expected: claim acquired (if already held by another session, defer per CLAUDE.md). + +- [ ] **Step 2: Install the artifacts (mirror setup-devvm.sh 9a/9d)** + +Run: +```bash +W=/home/wizard/code/infra/.worktrees/t3-idle-migrate/scripts +sudo install -m 0644 "$W/t3-safe-restart.sh" /usr/local/lib/t3-safe-restart.sh +sudo install -m 0755 "$W/t3-migrate-idle.sh" /usr/local/bin/t3-migrate-idle +sudo install -m 0644 "$W/t3-migrate-idle.service" /etc/systemd/system/t3-migrate-idle.service +sudo install -m 0644 "$W/t3-migrate-idle.timer" /etc/systemd/system/t3-migrate-idle.timer +sudo systemctl daemon-reload +``` +Expected: no errors. + +- [ ] **Step 2b: Re-point the live daily job at the installed lib (it now sources it)** + +The deployed `/usr/local/bin/t3-autoupdate` is the OLD inline version until setup-devvm re-runs; install the refactored one so both jobs share the lib: +```bash +sudo install -m 0755 "$W/t3-autoupdate.sh" /usr/local/bin/t3-autoupdate +sudo /usr/local/bin/t3-autoupdate # safe: same-version run exits at "already on nightly; nothing to do" +``` +Expected: log line `already on <track>=<ver>; nothing to do` (proves the refactored daily job sources the lib and runs clean). + +- [ ] **Step 3: DRY-RUN the idle migrator against live state** + +Run: `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?"` +Expected: with wizard currently busy (mid-turn during the day), a `skipped` count — `idle-migrate pass complete (migrated=0 skipped=N)` — and NO restart. (If wizard happens to be idle+quiet, it logs `DRY_RUN: would migrate t3-serve@wizard …` and still does not act.) + +- [ ] **Step 4: Seed a deferral marker for the current skew + dry-run again** + +The live daily job already deferred wizard but the marker mechanism is new, so create it once to represent the existing `.605→.613` debt: +```bash +sudo install -d -m755 /var/lib/t3-autoupdate/deferred +printf '%s\n' "$(t3 --version | awk '{print $NF}' | sed 's/^v//')" | sudo tee /var/lib/t3-autoupdate/deferred/wizard >/dev/null +sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle; echo "exit=$?" +``` +Expected: the pass now considers `wizard` — either `DRY_RUN: would migrate t3-serve@wizard.service -> …613` (if idle) or counted in `skipped` (if mid-turn). Confirms marker drain + gate wiring end-to-end without acting. + +- [ ] **Step 5: Enable the timer (live)** + +Run: `sudo systemctl enable --now t3-migrate-idle.timer && systemctl list-timers t3-migrate-idle.timer --no-pager` +Expected: timer active, next elapse in the 01:00–05:40 window. + +- [ ] **Step 6: Release the claim** + +Run: `homelab release host:devvm` + +> **First live migration** happens overnight at the first idle+quiet tick. Verify next session: `journalctl -u t3-migrate-idle.service --since yesterday | grep -E 'migrated|skipped|DRY|FROZEN'` and `t3 --version` vs the running server's version. (The user-facing resume-after-restart is observed here — design open-question (a).) + +--- + +## Task 8: Docs + +**Files:** +- Modify: `docs/runbooks/t3-version-bump.md` (add an idle-migrate section) +- Modify: `.claude/reference/service-catalog.md` (add the unit) +- Modify: `docs/plans/2026-06-21-t3-idle-migrate-design.md` (Status → implemented) + +- [ ] **Step 1: Runbook** — add a section after the autoupdate description: + +```markdown +## Idle migrator (`t3-migrate-idle.timer`) + +`t3-autoupdate` defers a user's `t3-serve` restart when they have an active agent +at the daily window, recording `/var/lib/t3-autoupdate/deferred/<user>`. +`t3-migrate-idle` (overnight, every 20 min 01:00–05:40) drains those markers: +it restarts a deferred instance onto the current binary only when that user's +`state.sqlite` shows no in-flight turn (`active_turn_id`) and ≥15 min quiet, via +the shared `safe_restart_unit` (same backup→verify→recover as the daily canary). +- **Force a migration now:** `sudo systemctl start t3-migrate-idle.service` (still idle-gated). +- **Preview without acting:** `sudo T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle`. +- **Stop it:** the shared `/etc/t3-autoupdate.freeze` halts both jobs. +- **Rare-tail failure:** a forward-migration failure at idle restart restores the + user's DB + freezes + alerts (the binary rollback is a no-op since the build was + already accepted); the user's server may crashloop on the restored DB until the + freeze is cleared. Investigate per the rollback section above. +``` + +- [ ] **Step 2: service-catalog** — add a row/line for `t3-migrate-idle.timer` (overnight idle-gated t3-serve migration; sources `t3-safe-restart.sh`). + +- [ ] **Step 3: design doc status** — change the header `Status:` to `implemented 2026-06-21 (commits on wizard/t3-idle-migrate)`. + +- [ ] **Step 4: Commit** + +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" add docs/runbooks/t3-version-bump.md .claude/reference/service-catalog.md docs/plans/2026-06-21-t3-idle-migrate-design.md +git "${GC[@]}" commit -m "docs: t3-migrate-idle runbook + service-catalog + design status + +Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>" +``` + +--- + +## Task 9: Land + +- [ ] **Step 1: Merge latest master into the branch** + +Run: +```bash +GC=(-c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false) +git "${GC[@]}" fetch forgejo +git "${GC[@]}" merge --no-edit forgejo/master +``` +Expected: clean merge (no conflicts; the files are new or autoupdate-only). Resolve if any. + +- [ ] **Step 2: Re-run the gate tests post-merge** + +Run: `bash tests/t3-migrate-idle-gate.test.sh` +Expected: `PASS=10 FAIL=0`. + +- [ ] **Step 3: Push to master** + +Run: `git -c filter.git-crypt.smudge=cat -c filter.git-crypt.clean=cat -c filter.git-crypt.required=false push forgejo HEAD:master` +Expected: accepted. Non-fast-forward → fetch/merge/retry. + +- [ ] **Step 4: Watch CI to completion** + +Run: `homelab ci watch` +Expected: green (infra apply pipeline — this change is scripts/docs only, no Terraform, so apply is a no-op for it). + +- [ ] **Step 5: Clean up the worktree** + +Run (from the main checkout): +```bash +git -C /home/wizard/code/infra worktree remove .worktrees/t3-idle-migrate +git -C /home/wizard/code/infra branch -d wizard/t3-idle-migrate +``` + +--- + +## Self-review + +- **Spec coverage:** marker mechanism (T2,T4) · shared safe-restart lib / approach C (T1) · idle gate active_turn_id+quiet (T3) · overnight timer (T5) · all-users self-limiting via markers (T4 loop) · failure recovery reuse (T1, note) · observability logs (LOG_TAG throughout) · delivery via setup-devvm (T6) · presence-claimed deploy (T7) · TDD on the gate (T3) · dry-run rollout (T7) · docs (T8). Optional Pushgateway marker-age gauge from the design is **intentionally deferred** (logged here as a follow-up, not built — keeps scope to the shipping mechanism). +- **Placeholders:** none — every file has complete content; every command has expected output. +- **Type/name consistency:** `safe_restart_unit`, `backup_user`, `prebump_of`, `gate_query`, `gate_is_safe`, `safe_to_restart`, `DEFER_DIR`, `QUIET_SECONDS`, `T3_SAFE_RESTART_LIB`, `LOG_TAG` used identically across tasks. `target`/`last_good` are documented caller-set globals consumed by lib functions. diff --git a/docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md b/docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md index f3ea2b8e..0d4d82c1 100644 --- a/docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md +++ b/docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md @@ -178,6 +178,16 @@ During the recovery, a second cascade was discovered that compounded the outage: **Still the real fix (from this PM, still TODO):** the P0 import-side cap, and especially the **IO-isolation** items — move k8s-master **etcd** + node OS disks off sdc onto SSD (generalize P3), and/or give the Immich library its own spindle (P1). Concurrency caps are a band-aid; sdc remains a single shared failure domain that every storm finds. Tracked in beads (see Follow-up Implementation). +## Update 2026-06-16 — 6th IO-pressure incident (same `anca-elements-import` Job re-triggered) + +**Same direct trigger as 2026-05-25.** The original `kubernetes_job_v1.anca_elements_import` resource block was never removed from `stacks/immich/main.tf` after the 2026-05-25 import completed — despite the in-code comment instructing "After successful completion: REMOVE this resource block + apply again." Every subsequent `terragrunt apply` of the immich stack re-created the Job. On 2026-06-16 ~20:50 UTC it ran again with the original `--concurrent-tasks 20`, scanning all 21,643 Immich assets in pure read-scan mode (`Uploaded 0`) for ~51 min. Result mirrored 2026-06-01: 62 of 64 nfsd threads in D-state on `folio_wait_bit_common`, sdc 80–82% util, **etcd starved → kube-apiserver crash-loop with `start-service-ip-repair-controllers failed: unable to perform initial IP and Port allocation check`**. Cluster unreachable; PVE host load peaked at 102 of 44 threads. The 2026-06-01 server-side job concurrency caps (`thumbnailGeneration=2, metadataExtraction=2, library=2`) held — the storm was on the import side, not the ML side. + +**Immediate recovery**: `nfsd` throttled `64 → 8` threads on the PVE host (gave apiserver enough headroom to come back), then `kubectl delete job -n immich anca-elements-import` + force-delete the pod. Storm cleared instantly: sdc 80% → 30% util, all nfsd threads idle, apiserver `/readyz: ok`. nfsd restored to 64. + +**Permanent fix (this commit)**: Removed `kubernetes_job_v1.anca_elements_import` AND the `module "nfs_anca_elements_host"` PVC from `stacks/immich/main.tf`. The photo batch is complete; per user, the videos batch is not on the near roadmap, so the PVC + the comment scaffold around it are gone too. The on-disk dump at `/srv/nfs/anca-elements` on the PVE host is **kept** (browseable via Nextcloud's admin-only "PVE NFS Pool" mount); decision on deletion deferred to user. A future import would re-add the PVC + a fresh Job (or, better, a one-shot manual `kubectl create job` invocation that does not live in Terraform — see Lessons below). + +**Updated lesson — one-shot Jobs do NOT belong in `kubernetes_job_v1`.** TF treats Jobs as long-lived resources and re-creates them on every apply if state drift is detected. A truly one-shot import either (a) becomes a `kubernetes_cron_job_v1` with `suspend = true` (Viktor can un-suspend → run → re-suspend) or (b) lives outside TF entirely as a `kubectl create job --from=...` ad-hoc invocation captured in `docs/runbooks/`. The "REMOVE this resource block + apply again" comment failed as a control because nobody noticed it for 22 days. + ## Related - 2026-05-09 IO post-mortem: `docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md` diff --git a/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md b/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md new file mode 100644 index 00000000..5fbfdc79 --- /dev/null +++ b/docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md @@ -0,0 +1,186 @@ +# Post-Mortem: t3 Nightly Auto-Update (0.0.25) Migrated `state.sqlite` Forward → mint/pairing Broke for All Devvm Users + +## Summary + +The devvm t3 auto-updater (`t3-autoupdate.timer`) pulled the `t3@nightly` +build `0.0.25-nightly.20260608.497`. That build ran two forward schema +migrations on every per-user `~/.t3/userdata/state.sqlite` (renaming +`role`→`scopes` in `auth_pairing_links` + `auth_sessions`, adding +`proof_key_thumbprint`) **and** changed the bootstrap API. The result was a +binary-vs-schema mismatch that broke `t3-mint` (pairing-credential issuance) +for **all** users — every fresh login landed on the t3 pairing prompt instead +of an authenticated session. + +## Impact + +- **Who:** every devvm t3 user — `wizard` (Viktor), `emo`, `ancamilea`. +- **What:** `t3 auth pairing create` failed (`AuthControlPlaneError: + Failed to create pairing link` → `PersistenceSqlError` on + `auth_pairing_links`), so `t3-dispatch` auto-pair returned 500/502 and the + browser showed the pairing prompt. Existing *already-authenticated* sessions + kept working (validated against `auth_sessions`, not the pairing path). +- **When:** ~13:56 (bad nightly installed) → ~15:16 (all users verified 302). +- **Trigger of the report:** Anca could not log in ("gets the pair prompt, + session broken"). + +## Timeline (devvm clock) + +- **13:56** — `t3-provision-users` step 5b ran `systemctl enable --now + t3-autoupdate.timer`. The timer is `OnCalendar=04:00 … Persistent=true`; + `--now` + a missed 04:00 schedule fired the daily job **immediately**. +- **13:56** — updater installed `t3@nightly` = `0.0.25-nightly.20260608.497` + (was `0.0.24`). The `GET / → 200` health-check **passed** (it never + exercises mint/bootstrap), so no auto-rollback. It restarted *idle* serves + (emo) onto 0.0.25 and deferred *active* ones (wizard, ancamilea). +- **~14:38** — `t3-mint` (now global 0.0.25) ran migrations 31 + (`AuthAuthorizationScopes`) + 32 (`AuthPairingProofKeyThumbprint`) against + each `state.sqlite` it touched → schemas moved to "level 32". +- **~14:40** — first recovery action rolled the **binary** back to `0.0.24`. + This did **not** help: the DBs were still at level 32, so the level-30 + binary's INSERT hit `no column named role` / `NOT NULL constraint failed: + scopes`. (Downgrading a binary after a forward migration is not a rollback.) +- **~15:01–15:16** — diagnosed the binary-vs-schema mismatch, confirmed + `0.0.25` *stable* is **also** dispatch-incompatible (auto-pair → 502, the + bootstrap API moved), pinned to `0.0.24`, reset the two new users' disposable + DBs, surgically reverted wizard's two auth tables to level 30. All three + users verified 302 + `Set-Cookie: t3_session`. + +## Root Cause + +Three compounding factors: + +1. **Auto-tracking a pre-1.0 tool's nightly.** `t3-autoupdate.sh` ran + `npm i -g t3@nightly`. t3 ships breaking schema-migration and bootstrap-API + changes between builds; our `t3-dispatch` (Go) speaks a fixed bootstrap + contract (`POST /api/auth/bootstrap {"credential":…}` → `Set-Cookie`). +2. **`enable --now` on a `Persistent=true` timer.** The provisioner's + re-assertion of the timer didn't just *arm* the schedule — it fired the + missed daily job on the spot, mid-afternoon, with users active. +3. **A health-check that proves nothing about auth.** The smoke test only + probes `GET / → 200`. The 0.0.25 server answers 200 while its pairing/mint + path is incompatible, so the "auto-rollback on bad build" never triggered. + +Forward migrations + a binary downgrade = a DB the old binary can't write. +`state.sqlite` also holds the precious projection tables (session history), so +a blanket "delete and re-pair" was only safe for the brand-new users. + +## Detection + +User report (Anca on the pairing prompt). No alert fired — the auto-updater's +own health-check is the only automated gate and it passed. **Gap:** nothing +monitors the end-to-end pairing flow. + +## Fixes & Mitigations + +### 1. Pin t3, stop tracking nightly (DONE) + +`infra/scripts/t3-autoupdate.sh` is now a **pinned-version enforcer**: +`T3_PIN="${T3_PIN:-0.0.24}"`, `npm i -g "t3@$T3_PIN"`. It re-asserts the pin +(a no-op when already correct) instead of chasing nightly. Unit `Description`s +updated. To move the pin: bump `T3_PIN` **and first** verify `t3-dispatch`'s +bootstrap flow against the new build (`curl` the dispatch → expect 302 + +`Set-Cookie: t3_session`). + +### 2. Drop `--now` from the provisioner (DONE) + +`infra/scripts/t3-provision-users.sh` step 5b now runs `systemctl enable +t3-autoupdate.timer` (no `--now`) — it arms the 04:00 schedule without firing a +missed job immediately. + +### 3. Pinned install at machine setup (DONE) + +`infra/scripts/workstation/setup-devvm.sh` installs `t3@$T3_PIN` directly, so a +fresh box has the pinned t3 immediately rather than depending on the enforcer's +first run. + +### 4. Recovery actions taken on the host (DONE) + +- Global `t3` rolled to `0.0.24`; enforcer redeployed + timer re-enabled + (verified the enforcer is a no-op at the pin). +- New users (`emo` 0 threads, `ancamilea` 1 trivial thread): `state.sqlite` + parked aside; serve restarted → fresh level-30 DB. +- `wizard` (96 threads, and the serve hosting the recovery session — cannot be + restarted): the two auth tables were atomically rebuilt to the level-30 + schema (copied from a fresh DB) and migration records 31/32 removed. + `auth_sessions` had 0 rows and the 0.0.24 serve never reads `scopes`, so the + live session and all projection history were untouched. Backup: + `/home/wizard/.t3/userdata/auth-backup-*.sql`. + +### 5. End-to-end pairing health-check (DONE — 2026-06-09 follow-up) + +`t3-autoupdate.sh`'s smoke test now exercises the REAL handshake — mint → +`POST` the credential (trying `browser-session` then `bootstrap`) → require +`200` + a `t3_session` cookie — not just `GET / → 200`. A build that renames or +breaks the pairing API now fails the check and **auto-rolls-back**, instead of +shipping a pairing-broken binary to everyone. + +### 6. Version-agnostic dispatch + reversible bumps (DONE — "prepare for 0.0.25") + +So the pin can move without another outage: +- **`t3-dispatch` is now version-agnostic** — `autoPair` tries + `/api/auth/browser-session` (0.0.25) and falls back to `/api/auth/bootstrap` + (0.0.24), so one binary pairs across the rename and through rolling-restart + skew. Covered by `TestAutoPairAcrossVersions`. Investigation confirmed the + 0.0.25 break was *only* this endpoint rename — the rest of the contract + (credential payload, `t3_session` cookie, `/api/auth/session`) is byte-identical. +- **`~/.t3` state is now backed up** — `t3-backup-state` (daily timer, online + `VACUUM INTO`, timeout-guarded) snapshots each user's `state.sqlite` (previously + the only copy, unbacked). This turns the one-way forward migration into a + *restore*, not sqlite surgery. +- **Cutover is a checklist** — `docs/runbooks/t3-version-bump.md` (pre-flight + verify, pre-bump backup, enforcer install + auto-rollback, verify, restore). + +## Lessons + +- **Don't auto-track a pre-1.0 tool's nightly.** Pin to a known-good, + contract-verified build; upgrades are a deliberate, tested act. +- **`enable --now` on a `Persistent=true` timer fires the missed job now.** + Use plain `enable` to arm a schedule without a surprise immediate run. +- **A liveness probe (`GET /`) is not a readiness/correctness probe.** If a + feature (auth/pairing) can break while `/` stays 200, the health-check must + exercise that feature or it gives false confidence. +- **A binary downgrade is not a schema rollback.** Once a forward migration + runs, the data is migrated; the old binary now mismatches its own DB. +- **Separate disposable state from precious state before resetting.** t3's + `state.sqlite` mixes ephemeral auth (`auth_pairing_links`, `auth_sessions`) + with precious history (`projection_*`); surgical table-level repair + preserved 8k+ messages that a blanket reset would have destroyed. + +## References + +- `infra/scripts/t3-autoupdate.sh` (gated nightly TRACKER since 2026-06-16; was the pinned enforcer), `.service`, `.timer` +- `infra/scripts/t3-provision-users.sh` step 5b +- `infra/scripts/workstation/setup-devvm.sh` step 2b +- `infra/.claude/reference/service-catalog.md` (t3 serving layer) +- Backup of wizard's pre-repair auth tables: `/home/wizard/.t3/userdata/auth-backup-*.sql` + +## 2026-06-16 update: gated nightly tracking deliberately re-enabled + +Viktor chose to **reverse the pin** and auto-track `t3@nightly` again — accepting +the churn risk — with the explicit requirement "make sure session auth works and +revert if the fallback/failure rate climbs." The naive nightly tracking that +caused this incident is now replaced by a GATED tracker that closes every gap the +root-cause + lessons sections named: + +- **Detection gap (was still open)** → the dispatch now logs every pairing + outcome (success endpoint + fallback) and the enforcer logs rollbacks/freezes; + Loki alerts (`T3PairingBroken`, `T3PairFallbackHigh`, `T3AutoUpdate*`) page on + real breakage. The pre-existing `t3-probe` only checks `GET /api/auth/session + == 200`, which stays 200 even when pairing is dead — it never caught this class. +- **"A liveness probe is not a correctness probe"** → the health-check now SEEDS + a throwaway serve with a COPY of a real populated `state.sqlite` and runs the + forward MIGRATION + real pairing handshake before trusting a build. +- **"A binary downgrade is not a schema rollback"** → mandatory pre-bump + `VACUUM INTO` backup; rollback restores the DB; a canary failure auto-restores + + self-freezes. +- **All-at-once blast radius** → canary rollout (idle instances one at a time, + pairing-verified through the dispatch; active-agent sessions deferred, never killed). +- **`enable --now` / boot-catchup firing a missed bump mid-day** → `Persistent=true` + dropped from the timer. + +Mechanism + freeze/revert/rollback ops: `docs/runbooks/t3-version-bump.md`. +First live cutover 2026-06-16: `0.0.26` → `0.0.28-nightly.20260616.571`, gated — +emo + ancamilea migrated + pairing-verified, wizard deferred (active session). +The headless `t3 serve` has **no in-app self-updater** (verified: no update-check +/ npm shell-out in `dist/bin.mjs`), so the npm install is the sole version +authority; the t3 UI's Stable/Nightly toggle governs the unused **desktop** app. diff --git a/docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md b/docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md new file mode 100644 index 00000000..3d5d8750 --- /dev/null +++ b/docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md @@ -0,0 +1,76 @@ +# Post-mortem: Authentik downgrade boot storm + shared-PG failover (2026-06-10) + +**Impact:** Authentik (and therefore forward-auth for all ~67 `auth="required"` +ingresses and every OIDC app) degraded/unavailable for ~50 minutes +(~22:20–23:10 UTC). The auth-proxy basicAuth fallback served Emergency Access +prompts during outpost-check failures. The shared CNPG primary failed over +(pg-cluster-2 → pg-cluster-1, 22:40:58 UTC), briefly disturbing every PG-backed +tenant. + +**Trigger:** a routine values-only `tg apply` on `stacks/authentik` (first-time +signin speedup work — env tuning, outpost config, static-asset ingress). + +## Root causes (three stacked) + +1. **Helm/Keel version split → silent downgrade.** Keel (namespace + `keel.sh/enrolled` + diun annotations) had upgraded the live authentik + image to `2026.2.4`, while the Helm release pinned chart `2026.2.2` (whose + appVersion drives the image tag). The values-only apply therefore rolled + every server/worker pod BACK to `2026.2.2` against a `2026.2.4`-migrated + database. Cores never came up healthy (`failed to proxy to backend`, plus + Django cross-version serialized-cache warnings), and mid-storm Keel + re-upgraded the image, adding a third ReplicaSet to the churn. + +2. **Liveness budget too small for authentik's boot.** The chart-default + liveness probe (3×10s, 3s timeout) kills a pod ~30s after the go layer + passes the startup probe — but during a rolling restart the Python core + still waits on authentik's DB **migration advisory lock** (60–120s+ under + contention). kubelet kill-looped every booting pod, and each kill increased + lock contention for the rest (thundering herd). + +3. **Ghost lock holders.** Pods killed mid-migration-check left PgBouncer + server connections `idle in transaction` still **holding the migration + advisory lock** (observed twice: `SELECT * FROM authentik_version_history` + idle 2+ min). Every subsequent boot serialized behind a dead client. + PgBouncer had no `idle_transaction_timeout`, so the ghosts never expired. + +**Aggravator:** `AUTHENTIK_POSTGRESQL__CONN_MAX_AGE=60` (newly made live) made +every Django thread hold its connection persistently; with PgBouncer in +*session* mode each one pins a server connection 1:1, so the restart churn +saturated all 3×(20+5) pool slots (58s/s client wait observed; authentik held +75 of 108 connections on the new primary). The shared primary's +restart/failover at 22:40 fits this storm window. + +## Resolution + +- Scaled workers to 0 (transient) to free pool capacity; rollout converged + once, then re-degraded when workers returned. +- Emergency `kubectl patch` of the server liveness probe (3×10s/3s → + 6×10s/5s) — final state codified in Helm values in the same session. +- `pg_terminate_backend()` on the ghost `idle in transaction` lock holders + (twice). +- Scaled servers to 1 so a single `2026.2.4` pod booted uncontended, then back + to 3 — converged cleanly (51s boots, zero restarts). +- Final `tg apply` reconciled everything (image tag pinned, conn_max_age + removed, liveness in values, pgbouncer reaper config). + +## Prevention (all landed in this change) + +| Cause | Fix | +|---|---| +| Helm/Keel version split | `global.image.tag` pinned in `values.yaml` to the Keel-managed live tag, with a comment requiring the pin be refreshed whenever the chart is touched. Long-term: bump the chart pin when Keel moves the image (diun notifies). | +| Liveness kill loop | `server.livenessProbe` 6×10s / 5s timeout in values (startup probe still bounds total boot at 60×10s). | +| Ghost advisory-lock holders | `idle_transaction_timeout = 300` in `pgbouncer.ini` + config-checksum annotation so ini changes actually roll pgbouncer pods. | +| Pool saturation | `CONN_MAX_AGE` removed (per-request connections are ~1–2ms through local PgBouncer; not worth pinning server connections in session mode). values.yaml carries a do-not-set warning. | + +## Lessons + +- **Check the live image tag against the chart pin before ANY helm-managed + apply on a Keel-enrolled namespace.** `kubectl get deploy <x> -o + jsonpath='{..image}'` vs the chart's appVersion — a mismatch means the apply + is a version change, not a config change. +- A "stuck rollout" of authentik is usually the migration advisory lock: + check `pg_locks` joined to `pg_stat_activity` for `idle in transaction` + holders before blaming probes or resources. +- The auth-proxy basicAuth fallback worked as designed throughout (Emergency + Access path); without it every protected app would have hard-failed. diff --git a/docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md b/docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md new file mode 100644 index 00000000..88378ae5 --- /dev/null +++ b/docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md @@ -0,0 +1,67 @@ +# 2026-06-10 — forgejo retention orphaned OCI index children (kms-website) + +## Impact + +- `viktor/kms-website:latest` and `:dfc83fb` unpullable (index children + HTTP 404). No runtime impact — the deployed tag `:a794d1a` was intact + and `imagePullPolicy: IfNotPresent` kept running pods unaffected. +- `RegistryManifestIntegrityFailure` firing from ~08:30 EEST; + `forgejo-integrity-probe` reported 4 failures across 60 indexes. + +## Root cause + +The `forgejo-cleanup` retention CronJob (live since 2026-06-09, first +deleting run 2026-06-10 04:00) computes its keep-set over package +**versions**: newest `KEEP_LAST_N=10` + tag `latest` + `*cache*` tags. +Forgejo's container registry stores multi-arch / buildx-attestation +**index children as separate untagged sha256 versions**. For images not +rebuilt recently, those children sort *older* than the newest-10 window +and were deleted while their parent index (a kept tag) survived → +orphaned indexes, children 404. + +The 2026-06-09 go-live verification ("0 running images on the delete +set") checked running **pods** against the delete list — it could not +see index→child references, so the corruption class passed review. + +Detection worked as designed: `forgejo-integrity-probe` (15-min catalog +walk + manifest HEAD) caught it the same morning. Two probe-run quirks +slowed diagnosis: runs occasionally die at startup (`apk add` during +transient DNS blips at cron ticks, `set -eu`), so the alert's +active-since (08:29:52) lagged the 04:00 corruption. + +## Fix applied (2026-06-10) + +1. `forgejo_cleanup_dry_run = true` (stacks/forgejo/cleanup.tf, applied) + — retention logs but deletes nothing until the keep-set is + container-aware. +2. `:latest` re-pointed at the intact `:a794d1a` index (registry + manifest PUT — `a794d1a` is also the newest commit of the repo, so + content is correct). +3. Corrupt, obsolete `:dfc83fb` package version deleted. +4. Probe re-run: **0 failures across 22 repos / 63 tags / 59 indexes**. + +## Follow-up (required before re-enabling deletes) + +Pick one: +- (a) keep-set expansion: for every kept tagged version, resolve the + manifest via the registry API; if it is an index, add all child + digests to the keep set; +- (b) never delete untagged sha256 versions (simpler, but untagged + garbage accumulates and the PVC pressure that motivated retention + returns — registry PVC sits at its 50Gi ceiling on the HDD, + see beads code-oflt); +- (c) replace the custom script with Forgejo's native per-owner package + cleanup rules, which are container-aware. + +Also worth probing beyond `TAGS_PER_REPO=5`: older tags of any +multi-arch image may already be orphaned (only newest-5 per repo are +verified). Harmless until someone pulls an old tag. + +## Lessons + +- "No running pod uses it" is not a safe deletion predicate for OCI + artifacts — reference graphs (index → child manifests) must be + resolved at the registry level. +- A `set -eu` probe whose first statement is a network package install + conflates "registry broken" with "apk blip"; pre-bake the image or + tolerate install retries. diff --git a/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md b/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md new file mode 100644 index 00000000..a9cd8c96 --- /dev/null +++ b/docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md @@ -0,0 +1,176 @@ +# 2026-06-10 — tuya-bridge down 7.5h: forgejo image pulls ride the public-IP hairpin + +## Impact + +- `tuya-bridge` (Flask/tinytuya bridge feeding HA-Sofia's ATS, fuse-main, + fuse-garage and 4 thermostat REST sensors) unavailable ~02:15–09:50 EEST. + HA REST sensors 503'd; the official-tuya integration devices were + unaffected (hybrid architecture limited the blast radius to the 3 power + devices' advanced telemetry + thermostats extras). +- Third incident from the same root cause class: + Woodpecker buildkit pushes (2026-06-04, code-yh33), tripit + ImagePullBackOff on node2/node3 + devvm git timeouts (2026-06-09), + tuya-bridge (this one). + +## Timeline (EEST) + +- **02:15** — tuya-bridge pod rescheduled onto `k8s-node3` (its previous + node5/6-era home was rebuilt 14d ago; the forgejo-path image was never + cached on node3 — only stale `docker.io/*` copies). Kubelet must pull + `forgejo.viktorbarzin.me/viktor/tuya_bridge:3216c87a`. +- **02:15→09:30** — 51 consecutive pull failures: + `dial tcp 176.12.22.76:443: i/o timeout` → ImagePullBackOff. HA shows + 503s (emo observed at 02:20). +- **09:40** — investigation: forgejo healthy via internal Traefik + (`10.0.20.203`), manifest exists; node3's hosts.toml mirror present and + correct; bare-IP request to the mirror returns **404 from Traefik**; + registry auth realm is the **absolute** public URL. +- **09:48** — `/etc/hosts` pin `10.0.20.203 forgejo.viktorbarzin.me` added + on node3; `crictl pull` succeeds immediately; pod replaced → Running; + `/health` ok; all 27 device `getstatus()` calls succeed; all 7 + `*_tuya_cloud_up` Prometheus gauges = 1. +- **10:05** — pin rolled to all 7 nodes; provisioning scripts + docs updated. + +## Root cause + +Fresh kubelet pulls of `forgejo.viktorbarzin.me` images depend on pfSense +NAT reflection of the public IP `176.12.22.76`, which is intermittently +broken from the `10.0.20.0/24` network. The containerd +`certs.d/.../hosts.toml` mirror that was *believed* to keep pulls internal +cannot do so, for two independent reasons: + +1. **Traefik routes by Host/SNI.** The mirror entry + `[host."https://10.0.20.203"]` makes containerd dial the bare IP (no + SNI, `Host: 10.0.20.203`) — no Traefik router matches → **404** → con- + tainerd treats the mirror as a miss and falls back to + `server = "https://forgejo.viktorbarzin.me"` → public DNS → hairpin. +2. **The Bearer auth realm is absolute.** `/v2/` challenges with + `realm="https://forgejo.viktorbarzin.me/v2/token"`; containerd fetches + that URL verbatim — this leg never goes through the mirror at all. + +So every fresh pull silently depended on hairpin luck. Cached images masked +the problem; it only fired when a pod landed on a node without the image +(node rebuilds, new nodes, evictions, new tags). + +Why DNS-side fixes don't reach this path: nodes resolve via systemd-resolved +→ pfSense (10.0.20.1) + public fallback (94.140.14.14), so Technitium +split-horizon (scoped to `192.168.1.0/24` clients) never applies; the +CoreDNS forgejo rewrite (2026-06-04) covers pods only, not kubelet. + +## Fix + +**Initial mitigation (same morning):** `/etc/hosts` pin +`10.0.20.203 forgejo.viktorbarzin.me` on every node — restored service +immediately (resolve + token + blob legs all internal with correct SNI). + +**Superseded same day (Viktor: "no hardcoded IPs in nodes") by a DNS-based +fix.** Discovery: Technitium's split-horizon zone *already* resolves +`forgejo.viktorbarzin.me → CNAME viktorbarzin.me → A <live Traefik IP>` — +the `technitium-ingress-dns-sync` CronJob auto-CNAMEs every ingress host +hourly, the apex A record tracks the live Traefik LB IP, and the +`viktorbarzin-apex-probe` canary alerts on drift. The nodes simply never +queried Technitium (resolv chain: pfSense + public AdGuard fallback). The +devvm already solved this with a systemd-resolved **routing domain** +drop-in; the same was rolled to all 7 nodes: + +``` +# /etc/systemd/resolved.conf.d/viktorbarzin.conf +[Resolve] +DNS=10.0.20.201 +Domains=~viktorbarzin.me +``` + +The `/etc/hosts` pins were then removed (verified `getent` still returns +the Traefik IP via DNS, and `crictl pull` succeeds). On node5/6 the +cloud-init `global-dns.conf` (`DNS=8.8.8.8 1.1.1.1`) was demoted to +`FallbackDNS=` only — public servers in the global set merge with and +race the routing domain. That file's original justification ("Technitium +NXDOMAINs forgejo.viktorbarzin.me") was obsolete: the ingress-dns-sync +has since added forgejo to the zone — a stale comment that actively +pointed new nodes at the hairpin. + +**Final architecture (same day, round 3 — Viktor: "no customization, +everything handled by the DNS infra"):** the routing-domain drop-ins were +ALSO removed; nodes are now completely stock. Two resolver-side changes +replaced them: + +1. **pfSense Unbound domain override** `viktorbarzin.me → 10.0.20.201` + (forward-zone to Technitium). Every Unbound client on every VLAN gets + the internal split-horizon answers with zero per-host config. No + DNSSEC complications (zone unsigned), private-IP answers pass, mail's + non-Traefik record (`→ 10.0.20.1`) verified working. Runbook: + `docs/runbooks/pfsense-unbound.md`; on-box backup + `config.xml.bak-2026-06-10-pre-me-forward`. +2. **CoreDNS pod carve-out** (TF, `stacks/technitium`): a dedicated + `viktorbarzin.me:53` server block pins forgejo to Traefik's + **ClusterIP** (interpolated from the live Service — pods cannot reach + the ETP=Local LB IP that pfSense now returns) and forwards all other + `.me` names to `8.8.8.8/1.1.1.1`, preserving pods' pre-existing + public-IP behavior. Replaces the old forgejo rewrite in `.:53`. + + **Addendum (same day, evening):** the "pods cannot reach the + ETP=Local LB IP" premise was re-tested and is FALSE on k8s 1.34 + (kube-proxy short-circuits in-cluster traffic to LB IPs via the + cluster path; verified from pods on three non-Traefik nodes). The + public-answer carve-out had meanwhile left pods as the only client + class still riding the TP-Link NAT loopback, which hard-died + 2026-06-09 — 27 non-proxied `[External]` uptime-kuma monitors dark. + Fix: the block now forwards to the Technitium ClusterIP + (`10.96.0.53`) — pods are ordinary internal clients; forgejo pin + kept for Technitium-outage resilience. In-cluster `[External]` + monitors now test the internal path for all names; genuine + edge-path fidelity belongs to a true external vantage (ha-london). + +node5/6 were also re-pointed from link-DNS=Technitium to +`10.0.20.1 94.140.14.14` (netplan + `qm set --nameserver` on PVE VMs +205/206) for fleet parity, and their `global-dns.conf` was deleted. + +**Renumber hazard: resolved.** A future Traefik LB renumber propagates +via the apex A record automatically (drift probe alerts if it doesn't); +only the vestigial hosts.toml literal goes stale. **Trade-offs:** +`viktorbarzin.me` resolution via pfSense depends on in-cluster Technitium +(3 replicas) — SERVFAIL during a full cluster outage (services down +anyway; bootstrap images pull via the IP-addressed `10.0.20.10` mirrors). +Nodes keep `94.140.14.14` as secondary DNS: a resolved failover during a +pfSense blip briefly re-exposes public answers — rare, self-healing, +accepted. + +## Verification (final architecture) + +- All 7 nodes stock (no pins, no drop-ins); `getent hosts + forgejo.viktorbarzin.me` → `10.0.20.203` via pfSense → Technitium; + general resolution intact; `crictl pull` succeeds end-to-end. +- pfSense: forgejo/immich/vault → apex CNAME → `.203`; mail → + `10.0.20.1` (`:993` verified); `google.com` public; `.lan` auth-zone + unaffected. +- Pods: forgejo → `10.111.111.95` (Traefik ClusterIP), + immich → `176.12.22.76` (public, status quo) — verified in-pod after + CoreDNS reload. +- tuya-bridge pod Running; `/health` `ok=true`; 27/27 devices + `success=true`; 7/7 `*_tuya_cloud_up` gauges = 1; no tuya-related alerts. + +## Lessons + +- A mirror that *can* fall back to a broken path is not a fix — it's a + latency bomb with the blast delayed until the cache misses. +- Registry token realms are absolute URLs: any "redirect the registry" + scheme must also redirect the *name*, not just the endpoint. +- Before inventing a redirect mechanism, check what the DNS authority + already serves: the Technitium split-horizon zone had the correct, + auto-maintained answer all along — the clients just weren't asking it. +- Stale config comments are load-bearing: the obsolete "Technitium + NXDOMAINs forgejo" comment in cloud-init steered new nodes onto public + DNS, recreating the hairpin exposure on every node added after it. +- All `10.0.x` legs are now DNS-routed (nodes + devvm via routing domain, + pods via CoreDNS rewrite). pfSense Unbound host overrides remain an + option for other LAN segments if a non-Technitium client ever needs + internal answers (live network device — deliberate, separate change). + +## Related + +- Beads `code-2or8` (Tuya Cloud subscription) — verified resolved during + this incident: subscription is active again, all gauges green; closed. +- 2026-06-09 tripit ImagePullBackOff — same cause, self-recovered when the + hairpin flapped back; the two `ScrapeTargetDown[tripit]` alerts firing + during this investigation were scrapes of *Completed* cronjob pod + endpoints (separate monitoring wart, not this outage). diff --git a/docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md b/docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md new file mode 100644 index 00000000..d62ed5ff --- /dev/null +++ b/docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md @@ -0,0 +1,116 @@ +# 2026-06-11 — devvm dead ~90 min: QEMU-internal I/O stall on the legacy LSI disk path + +## Impact + +- devvm (VM 102, the shared multi-user Claude Code workstation) effectively + dead 15:21–16:48 UTC (18:21–19:48 EEST): all ssh/tmux and t3 sessions for + wizard/emo/anca lost, every in-flight agent killed. +- Detection was human (~90 min) — no `up{instance="devvm"} == 0` alert + exists (follow-up below). +- Recovery was manual: kill of the wedged QEMU process + `qm start` (the + kill left no autopsy — see "What we could not prove"). + +## Timeline (UTC; host journal runs EEST = UTC+3) + +- **15:01** — hourly `apply-mbps-caps` run live-rewrites VM 102's scsi0 + throttle via `qm set` (as it had done every hour for weeks — see Root + cause #4). +- **15:18–15:20** — guest healthy by every metric: CPU 7–16% of 16 vCPUs, + load 1.4, 17 GiB MemAvailable, swap flat at 2.0 GiB, host `sdc` 2–8% + utilized. Heavy claude/bwrap sandbox activity (normal workload). +- **15:19:08** — last journal line the guest ever writes (mid normal + traffic, zero kernel distress — not even a hung-task warning). +- **15:21** — host RRD (pvestatd polling QEMU over QMP once a minute) shows + `diskwrite` drop to **exactly 0 and stay 0 for 87 minutes** — not even + journal flushes. netout collapses 380K→7K/s. **QEMU keeps answering QMP + the whole time** — the process and its main loop are alive; only the + block path is dead. +- **15:21→15:39** — guest CPU (host's view) ramps 11% → ~50% and plateaus: + processes progressively piling up behind dead storage (dirty-page + writeback stuck → direct reclaim spins). Classic starvation cascade, not + a panic (a panic halts or spins flat from t=0). +- **16:47:42** — QMP socket resets: the wedged QEMU is killed out-of-band + (root shell; no PVE task, no snoopy line — shell-builtin `kill`). +- **16:48:31** — `qmstart` task; guest boots clean on kernel 6.8.0-124 + (wedged boot ran 6.8.0-117). + +## Ruled out (evidence, not vibes) + +- **Guest CPU/memory/swap pressure** — healthy at last scrape (Prometheus) + and per-minute host RRD. +- **Host storage** — `pve` thin pool 68% data / 15.5% meta; zero kernel + I/O errors on the host all day; `sdc` quiet through the window. +- **Host-side kill/OOM** — no OOM-killer lines, no segfault, no QEMU crash + log; 113 of 114 monitored targets stayed up. Only the devvm died. +- **Guest kernel panic** — would not keep QMP-visible blockstats frozen at + 0 while netout ACKs trickle; and the guest kernel logged nothing. + +## Root cause + +**Class pinned, exact line unprovable** (see below): the devvm's disk I/O +stalled *inside the QEMU process* — below the guest kernel (all guest I/O +froze simultaneously with nothing logged) and above host storage (host +clean, neighbors fine, QEMU main loop responsive). Contributing stack, +unique to this VM: + +1. **`scsihw: lsi`** — the emulated LSI 53C895A (1997 chip, QEMU's legacy + default for OSes without virtio drivers). The devvm was the **only VM + on the host** running its disk through this path; every healthy + neighbor uses `virtio-scsi-pci`. The LSI model is documented as + hang-prone under intensive I/O. +2. **No `iothread`** — all disk emulation ran on QEMU's single main event + loop, sharing it with timers and QMP. +3. **QEMU-level mbps throttle (60/60)** — a token bucket inside QEMU whose + queued I/O completes only when its re-arm timer fires. +4. **Hourly live throttle rewrites** — `apply-mbps-caps.sh`'s idempotency + check compared raw config strings, but `qm config` prints keys in its + own canonical order, so the check **never matched** and the script + re-issued `qm set` (→ live QMP `block_set_io_throttle` against the + running QEMU) every hour, 24×/day, for weeks — each poke a chance to + race the throttle machinery while queued I/O is in flight. The wedge + came 20 min after the 15:01 poke. + +## What we could not prove + +Whether the stuck queue was the LSI device model, the throttle-group +timer, or their interaction. The discriminating evidence (QMP +`query-block`, a stack trace of the QEMU process) existed in RAM at 16:47 +and was destroyed by the recovery kill. If a wedge recurs **autopsy before +shooting**: `qm guest exec` will fail but `qm monitor`/QMP `query-block`, +`query-status`, and `gdb -p <pid> -batch -ex 'thread apply all bt'` on the +kvm process pin it to the line. + +## Fixes + +| Status | Fix | +|---|---| +| shipped (this commit) | `apply-mbps-caps.sh` compares **normalized option sets** — hourly runs are now true no-ops; running VMs' throttle state is no longer rewritten 24×/day. Verified: reordered-key configs compare equal, real drift still triggers `qm set`, post-restart iothread configs compare equal. | +| staged, awaiting Viktor's cold stop→start | VM 102: `scsihw: virtio-scsi-single` + `scsi0 …,iothread=1,aio=threads` — replaces the LSI path with the paravirt controller all healthy VMs use, moves disk emulation off the main loop, swaps io_uring for boring thread-pool AIO. Guest pre-flight passed (`CONFIG_SCSI_VIRTIO=y` built-in; fstab on LVM dm-uuid/UUID). Must be a **full stop→start** — a guest reboot reuses the old QEMU process. | + +## Open follow-ups (discussed 2026-06-11, not yet built) + +- `DevvmDown` alert (`up{job="devvm"} == 0 for 3m` → Slack) — closes the + 90-min detection gap. +- Freeze forensics: netconsole → pve listener, serial console, + `kernel.panic=60`, and a capture-before-kill runbook (above) so any + recurrence is pinned, not mourned. +- The recurring *crawl* class (agent storms → swap-thrash; journald + watchdog-killed 3× on 2026-06-10) is a separate failure mode — + ssh/tmux sessions remain memory-uncontained by explicit decision + (swap-only, 2026-06-10). + +## Lessons + +- **A VM can die of QEMU-userspace causes that no guest or host kernel log + will ever show.** The host's per-VM RRD (pvestatd's QMP polls) is the + only witness — `diskwrite=0` with a live QMP socket is the signature. +- **"Idempotent" reconcilers must prove idempotency against the system's + canonical output format**, not against the string they themselves + constructed. A compare that never matches turns a safety net into a + 24×/day fault injector — and its own journal said `updating scsi0` + every hour, in plain sight, for weeks. +- The May-26 mbps caps fixed the sdc-saturation freeze class and + introduced this one's trigger surface. Layered mitigations fail in + layers — audit what a fix *adds*, not only what it removes. +- pve host logs are **EEST (UTC+3)**; guest logs are UTC. Every + cross-machine correlation in this incident initially looked 3h off. diff --git a/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md new file mode 100644 index 00000000..664869fa --- /dev/null +++ b/docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md @@ -0,0 +1,131 @@ +# 2026-06-22 — devvm memory/IO overload: per-user containment + OOM backstop + +## Impact + +- devvm (VM 102, the shared multi-user Claude Code workstation) became + unresponsive under combined memory + IO pressure and had to be **hard-killed + + rebooted** by the admin on 2026-06-22 (morning). All ssh/tmux + t3 sessions for + wizard/emo/anca lost, in-flight agents killed. +- Signature on the last htop before the kill: **load avg ~60** on 32 vCPU, **RAM + 22.5/23.5G**, **swap 13.9/14.0G (full)**, a wall of **D-state** (uninterruptible + IO-wait) processes, and a single `ugrep` in emo's tmux holding **~10G RES / + 64% CPU**. Many `claude --effort max/xhigh` sessions + playwright-chrome MCP + instances across three users on top. + +## This is the "crawl" class, not the QEMU-stall class + +The 2026-06-11 post-mortem (`2026-06-11-devvm-qemu-io-stall.md`) fixed a +*different* failure mode — a QEMU-userspace block-path wedge on the legacy LSI +controller. That fix shipped (verified 2026-06-22: the guest now boots on +`virtio_scsi`, `scsihw: virtio-scsi-single + iothread`). Its post-mortem +explicitly deferred **this** class: + +> The recurring *crawl* class (agent storms → swap-thrash; journald +> watchdog-killed 3× on 2026-06-10) is a separate failure mode — ssh/tmux +> sessions remain memory-uncontained by **explicit decision (swap-only, +> 2026-06-10)**. + +That explicit decision is the root cause closed here. + +## Root cause + +Work on the devvm lives in **two independent cgroup-v2 trees per user**, and only +one was capped: + +| Tree | cgroup | Cap before today | +|---|---|---| +| t3 web sessions | `system.slice/system-t3\x2dserve.slice/t3-serve@<user>` | `MemoryHigh=12G MemoryMax=16G MemorySwapMax=0 OOMPolicy=continue` ✓ | +| **ssh/tmux sessions** | `user.slice/user-<uid>.slice` | **`MemoryMax=infinity`, swap unlimited** ✗ | + +The uncapped `user-<uid>.slice` was the hole. A runaway there (the 10G `ugrep`; +stacked max-effort agents) grew unbounded, spilled into the **14G disk swap**, and +swap-thrashed the **host-mbps-throttled (60/60 MB/s) virtual disk**. That is the +overload chain: + +``` +uncapped tmux growth → disk-swap thrash on a throttled spindle + → IO storm (D-state pileup) → load ~60 → box unresponsive → hard kill +``` + +i.e. **memory pressure becomes the IO storm**. There was also **no global OOM +backstop** (no systemd-oomd / earlyoom) to shed the worst offender before the +kernel OOM or the thrash-wedge. And even the existing t3 caps don't sum safely +(3 users × 16G = 48G > 32G RAM) — nothing reasoned about the *whole box*. + +## Fix (`setup-devvm.sh` §10, applied live 2026-06-22) + +Design decisions (interviewed with the admin via `/grill-me`): **soft-generous +per-user caps + a hard ceiling + a kill-the-worst backstop**, maximising +single-user utilisation while making a box-wide wedge impossible. (The backstop +was first built on systemd-oomd, then switched to earlyoom mid-rollout when oomd +proved inert with `swap=0` — see Verification + Lessons.) + +| Layer | What | +|---|---| +| **Per-user caps, BOTH trees** | `user-.slice.d` drop-in gives every `user-<uid>.slice` the same `MemoryHigh=12G / MemoryMax=16G / MemorySwapMax=0` the t3 tree already had. A user is now bounded in whichever surface they work in. | +| **No disk swap for work** | `MemorySwapMax=0` on every work cgroup → a spike OOMs **locally** at the ceiling instead of thrashing the throttled disk. Kills the IO-storm-via-swap mechanism at the source. The 14G swapfile stays for system cold pages only. | +| **earlyoom backstop (free-RAM threshold)** | New package — used **instead of systemd-oomd** (which is inert with `swap=0`; see Lessons). Watches `MemAvailable%` and SIGTERMs the biggest task at **5%**, SIGKILL at **3%**, swap ignored (`-s 100`). `--avoid` keeps sshd/systemd/dockerd/containerd/t3-dispatch/tmux off the victim list (**the admin's way in always survives**); `--prefer` targets the agent/browser hogs (python3/node/chrome/…). Swap-independent and reliable, where oomd's pressure-kill was not. | +| **Fair-share CPU/IO** | `CPUWeight`/`IOWeight` per slice (system.slice 200, users + docker 100 each). Work-conserving — a lone user still gets all 32 cores + the full IO budget when others idle; weights only bite under contention. No hard CPU/IO caps. | +| **Docker containment** | Containers previously landed in `system.slice` — uncapped. Now `cgroup-parent: docker.slice` in `daemon.json` routes every container into a capped (`MemoryMax=8G`, swap 0) slice, so a runaway container is cgroup-OOM'd locally instead of escaping into the uncapped `system.slice`. | + +Durable in `setup-devvm.sh` (survives a VM rebuild); `earlyoom` added to +`packages.txt`. The numbers are tunable — `MemoryHigh=12G` will throttle a *lone* +heavy user between 12–16G even with RAM free; bump to 16/20 if that bites. + +## Verification (live, 2026-06-22) + +- **Caps live on running cgroups**: all three `user-<uid>.slice` report + `memory.high=12G memory.max=16G memory.swap.max=0`; `docker.slice` `memory.max=8G`; + daemon.json kept buildkit/nvidia/insecure-registries; paperless-mcp recovered + under `docker.slice`. +- **Stress test A (hard cap)** — the PRIMARY guard: a 2G-capped, swap=0 balloon was + killed at exactly 2G by the cgroup-local OOM (`constraint=CONSTRAINT_MEMCG`) with + **swap flat at 0MB throughout** — no thrash. Same mechanism protects every user + slice (16G) and `docker.slice` (8G). +- **Soft cap observed**: a balloon pushed past `MemoryHigh` sat at ~220M / 99% + memory.pressure, throttled to a crawl, making no progress and harming nothing — + a runaway is throttled, not just killed. +- **systemd-oomd disproven, then dropped**: a self-policed balloon held + `memory.pressure full avg10 = 96–99%` (≫ its 20% limit) for >70s but oomd never + killed it — `Pgscan: 0`. oomd's pressure-kill only acts on cgroups doing active + reclaim, which a `swap=0` anon workload never does. oomd was purged. +- **earlyoom backstop** — verified via `--dryrun`: at the threshold it logs + `low memory! … mem 90% swap 100%` (fires on RAM alone, swap ignored) and selects + `SIGTERM … "chrome"` (a `--prefer` hog), never an `--avoid`'d daemon. Live + earlyoom v1.7 confirms `SIGTERM mem<=5% / SIGKILL mem<=3%, swap<=100%`. + +## Out of scope / follow-ups + +- **Alerting** (tracked, fast-follow bead): `DevvmDown` (closes the 90-min + detection gap the 2026-06-11 PM flagged), sustained-memory-PSI/swap pressure + early-warning, and an "earlyoom-killed-something" alert (earlyoom logs each kill; + `-N /script` can push a metric). devvm node-exporter is already scraped + (`job=devvm`, `10.0.10.10:9100`), so only alert *rules* are new (a + monitoring-stack Terraform change). +- **zram cushion**: considered, deferred. Could let work cgroups absorb spikes in + compressed RAM instead of OOMing at the ceiling; not needed for the wedge fix. +- **Per-user docker isolation**: containers share one `docker.slice` budget, not + per-user. Fine for current usage (krr + short-lived tools). +- **Host-side IO**: the 60/60 mbps cap + the shared `sdc` HDD IO domain are + host-level (bead `code-oflt`); unchanged here. + +## Lessons + +- **"Swap as the safety valve" is an IO-storm amplifier on a throttled disk.** + Leaving ssh/tmux memory-uncontained (the 2026-06-10 decision) traded a clean + local OOM for a box-wide swap-thrash wedge. `MemorySwapMax=0` + a hard cap turns + the failure back into a contained, local kill. +- **Cap the box, not one surface.** t3 sessions were capped for months while the + same user's tmux was unbounded — and the caps that existed didn't sum to < RAM. + Containment has to reason about every tree and the aggregate. +- **A backstop must protect the operator's way in.** earlyoom `--avoid`s + sshd/systemd/dockerd/containerd/t3-dispatch/tmux, so the box always stays + reachable to recover; only the agent/browser hogs are eligible victims. +- **systemd-oomd is the wrong backstop for a no-swap box — verify, don't assume.** + oomd's memory-pressure killer only fires on cgroups doing active reclaim + (`pgscan` rising). With `MemorySwapMax=0` + anonymous memory there is nothing to + reclaim, so a cgroup sat at 99% `memory.pressure` indefinitely and oomd never + acted (proven with `oomctl` + a balloon). The very `swap=0` that kills the IO + storm also neuters oomd. earlyoom (free-RAM threshold, swap-independent) is the + correct pairing. A famous tool that "does OOM" still has to be proven to fire + under *your* configuration. diff --git a/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md new file mode 100644 index 00000000..e6b11816 --- /dev/null +++ b/docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md @@ -0,0 +1,97 @@ +# Post-mortem: k8s 1.34→1.35 upgrade stalled — etcd IO starvation (2026-06-24) + +> Filename kept for inbound links. The originally-suspected cause (kubeadm-config +> OIDC drift) turned out **not** to be the crash — see "Correction" below. The OIDC +> drift was a real *separate* latent bug fixed in the same change. + +**Impact:** The autonomous k8s-version-upgrade chain (23:00 UTC nightly) reached +the master control-plane phase for the first time — preflight passed, etcd +snapshot taken, master cordoned + drained, etcd upgraded 3.6.5→3.6.6 — then the +kube-apiserver upgrade to v1.35.6 **crash-looped**. kubeadm waited its 5-minute +static-pod-hash window across all internal retries, then auto-rolled-back to +v1.34.9. The cluster stayed healthy on 1.34.9 (apiserver, all 7 nodes Ready), but +the run left **k8s-master cordoned** and the chain **wedged on `in_flight=1`**. +No data loss; no user-facing outage (the master carries control-plane taints, so +no workloads were displaced). + +**Trigger:** the first *minor* upgrade the chain ever attempted (1.34→1.35) — the +first time kubeadm upgrades etcd (3.6.5→3.6.6) and regenerates the control-plane +static pods, i.e. the first time the upgrade pushes real write-IO at etcd. + +## Root cause — etcd IO starvation on the shared HDD + +The new kube-apiserver could not establish/keep a working connection to etcd +during the upgrade because **etcd was IO-starved**. etcd's surviving container log +from the crash window (`/var/log/pods/.../etcd/0.log`, 23:04–23:20 UTC) shows: + +- **1,180** `apply request took too long` warnings in 16 minutes; +- individual applies of **4.3s / 2.9s / 2.7s / 1.8s** (healthy is <100ms), + clustered at **23:18:51 UTC** — exactly when kubeadm's final attempt was trying + to bring the new apiserver up. + +A reproduced 1.35.6 apiserver with no etcd dies with +`F instance.go:233 Error creating leases: error creating storage factory: context +deadline exceeded` — the same failure mode a multi-second etcd produces. etcd +lives on the contended `sdc` HDD (**beads code-oflt**: "etcd/critical VM disks on +shared sdc HDD — recurring IO-storm root cause"). The upgrade itself piled IO onto +that spindle: + +1. etcd's own upgrade-restart + WAL/db re-read (it restarted ~23:04, re-elected); +2. kubeadm dumping a full **~400MB etcd DB backup** to + `/etc/kubernetes/tmp/kubeadm-backup-etcd-<ts>/` (on the same HDD) before the + etcd upgrade — and **145 of these had accumulated to 28GB** (kubeadm never + cleans them up), pushing master root fs to **73%**, above the 70% kubelet + image-GC threshold, so image GC churned during the drain too; +3. master-drain pod evictions. + +### Correction — it was NOT the OIDC flag swap + +`kubeadm upgrade diff v1.35.6` showed the regenerated manifest also swaps +`--authentication-config` (structured multi-issuer OIDC) back to legacy +single-issuer `--oidc-*` flags (kubeadm-config drift, see secondary finding). That +was the *first* hypothesis — but an isolated repro of the 1.35.6 apiserver with +those exact `--oidc-*` flags **and authentik reachable** initialised OIDC cleanly +(`oidc.go:313`, no error) and ran fine until it hit the (deliberately dead) test +etcd. So the auth swap does **not** crash the apiserver; it was a red herring for +the crash. Image pull (all v1.35.6 images pre-pulled), OOM (none), and disk-full +were also ruled out. + +## Secondary finding (real, fixed separately) — kubeadm-config OIDC drift + +apiserver auth is configured in three places that must agree: +(1) `/etc/kubernetes/pki/auth-config.yaml` (structured, two issuers: `kubernetes` ++ `k8s-dashboard`, added 2026-06-19); (2) the live static-pod manifest +(`--authentication-config`); (3) the kubeadm-config `ClusterConfiguration` CM — +which still carried the legacy `--oidc-*` extraArgs. `kubeadm upgrade` regenerates +the manifest from (3), so it would have reverted structured auth → **dashboard + +kubectl SSO break after a successful upgrade** (recoverable: the chain's +post-master `restore.sh` re-adds the flag). This is a real bug, just not the crash. + +## Resolution + +1. **Reclaimed the 28GB kubeadm scratch** on master (`/etc/kubernetes/tmp/kubeadm-backup-*`) — root fs 73% → 23%. +2. **Reconciled kubeadm-config live** (zero cluster impact — CM only read at upgrade time): dropped `--oidc-*`, added `--authentication-config` via `kubeadm init phase upload-config kubeadm`. `kubeadm upgrade diff` then shows only the control-plane image bumps. +3. **Recovered:** uncordoned k8s-master, cleared the stuck `in_flight` gauge + annotation, deleted last night's Complete/Failed `1-35-6` phase jobs (a Complete preflight would otherwise make the detector idempotent-skip the re-run). + +## Prevention (landed in this change) + +| Gap | Fix | +|-----|-----| +| kubeadm leaks ~400MB etcd-DB backups into `/etc/kubernetes/tmp` forever (→ disk fills, image-GC churn, write-IO on etcd's spindle) | **`upgrade-step.sh` preflight now prunes** `/etc/kubernetes/tmp/kubeadm-backup-*` + `kubeadm-upgraded-manifests*` older than 3 days on master, every run. Best-effort, never aborts. | +| kubeadm-config drift would silently break SSO after an upgrade | `apiserver-oidc.tf`'s remote script now **also reconciles kubeadm-config** (`kubeadm init phase upload-config`), delivered via the `apiserver-oidc-restore` ConfigMap the chain re-runs (CI needs no ssh) or a local `-replace` apply. Preflight **alerts** (not blocks — SSO drift is recoverable) if `kubeadm upgrade diff` would still drop `--authentication-config`. | +| etcd on the contended `sdc` HDD starves under upgrade IO | **Durable fix is beads code-oflt** (move etcd/critical VM disks off `sdc`). Not in this change. Mitigations above reduce the upgrade's own IO; reclaimed disk removes the image-GC variable. | + +## Lessons + +- **Capture the failing component's own logs before concluding.** The `kubeadm + upgrade diff` made the OIDC swap look like the cause; only etcd's log (multi-second + applies) + an isolated apiserver repro showed the truth (etcd IO). A clean diff is + "what config changes," not "why it crashed." +- **etcd on shared HDD is the cluster's recurring fragility** (immich IO storm + 2026-05-25, this stall). Upgrades concentrate IO (etcd restart + kubeadm's 400MB + backup copy + drain) onto that spindle. code-oflt is the real fix. +- **Tools that leave per-operation scratch must be reaped.** kubeadm's + `/etc/kubernetes/tmp` etcd backups are throwaway (real backups → NFS) but never + GC'd; 28GB had silently accumulated. +- **Out-of-band control-plane edits must be written back to kubeadm-config** — else + `kubeadm upgrade` silently reverts them (here: SSO; could be admission/audit/API flags). diff --git a/docs/runbooks/breakglass-ssh.md b/docs/runbooks/breakglass-ssh.md new file mode 100644 index 00000000..348586f8 --- /dev/null +++ b/docs/runbooks/breakglass-ssh.md @@ -0,0 +1,158 @@ +# Runbook: Break-glass SSH + +Cold-survivable, brute-force-proof SSH onto the home LAN for when the Kubernetes +cluster and its remote-access tunnels (Headscale, cloudflared) are down but the +**Proxmox host + edge router are up**. Redesigned 2026-06-11 — the previous +port-knock design is decommissioned (see "History" below). + +## Model (as built) + +``` +your laptop (anywhere) ── ssh -p 52222 ──▶ edge router 192.168.1.1 + │ WAN tcp/52222 ─▶ 192.168.1.127:52222 + ▼ + Proxmox host 192.168.1.127 + sshd :52222 (key-only, break-glass key ONLY) + → full LAN via ssh -J / ssh -D +``` + +- **No port-knock.** Plain `ssh -p 52222`. The SSH key is the only gate. +- **Key-only**, brute-force-proof. The exposed `:52222` trusts **only** the + dedicated break-glass key (`/root/.ssh/authorized_keys.breakglass`), separate + from root's normal LAN-admin keys, so it is independently revocable and a leak + of any other root key does not grant internet access. +- **Rate-limited** per source IP (iptables hashlimit) + **fail2ban**. These trim + scanner noise only; key-only auth is the real protection. +- **Exposed, not hidden.** `:52222` answers on the WAN (Shodan-visible). This is + a deliberate, documented exception to the Wave-1 "no public-IP access" policy + (see `docs/architecture/security.md`), chosen for self-containment: it has **no + dependency on the cluster** (unlike Headscale/cloudflared) and nothing to + remember (unlike the old knock, whose sequence lived only in in-cluster Vault). + +## Secrets (Vault `secret/viktor`) + +| Key | Use | +|---|---| +| `breakglass_ssh_pubkey` | authorized on the host (`authorized_keys.breakglass`) | +| `breakglass_ssh_privkey` | the private key (also on your laptop at `~/.ssh/breakglass_ed25519`) | + +The key has **no passphrase** (so it works in a true cold event without anything +to recall). Treat the private key as the sole credential — guard the laptop copy. + +> Leftover: `breakglass_knock_sequence` is dead (knock decommissioned). It is +> inert; remove it when you have a Vault token with the `patch` capability +> (`vault kv patch` / merge-patch — the everyday token lacks it). + +## Connect + +Client `~/.ssh/config`: + +``` +Host breakglass + HostName viktorbarzin.ddns.net # follows the dynamic WAN IP + Port 52222 + User root + IdentityFile ~/.ssh/breakglass_ed25519 + IdentitiesOnly yes +``` + +Then: + +```bash +ssh breakglass # shell on the Proxmox host +ssh -J breakglass root@10.0.20.1 # jump to pfSense (or any LAN host) +ssh -D 1080 breakglass # SOCKS5 → reach any internal IP +``` + +There is **no `bg()` knock function** anymore — delete it from your shell rc if +you added it under the old design. + +## Cold-event IP cheat sheet (cluster DNS is down) + +| Host | IP | +|---|---| +| Proxmox host | `192.168.1.127` | +| pfSense | `10.0.20.1` (WAN `192.168.1.2`) | +| k8s API | `10.0.20.100` | +| Synology NAS | `192.168.1.13` (reach via `ssh -J breakglass`) | +| edge router | `192.168.1.1` | + +## Deploy / re-provision the host config + +Source of truth lives in `infra/scripts/`. To (re)deploy: + +```bash +# 1. break-glass key authorized for the exposed port +PUB="$(vault kv get -field=breakglass_ssh_pubkey secret/viktor)" +ssh root@192.168.1.127 "printf '%s\n' '$PUB' > /root/.ssh/authorized_keys.breakglass && chmod 600 /root/.ssh/authorized_keys.breakglass" + +# 2. sshd drop-in (dual-port, Match-isolated) — validate before reload (anti-lockout) +scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf +ssh root@192.168.1.127 'sshd -t && systemctl reload ssh' + +# 3. firewall (rate-limit) + boot unit +scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh +ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl enable --now breakglass-firewall.service' + +# 4. fail2ban jail +scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local +ssh root@192.168.1.127 'systemctl restart fail2ban && fail2ban-client status sshd' +``` + +The `breakglass-firewall.service` unit (oneshot, `RemainAfterExit=yes`, +`Before=network-online`-ish ordering) is a manual host unit — recreate it if the +host is rebuilt: + +```ini +[Unit] +Description=Break-glass base firewall (key-only SSH on :52222) +After=network-pre.target +Wants=network-pre.target +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/breakglass-firewall.sh +RemainAfterExit=yes +[Install] +WantedBy=multi-user.target +``` + +## Edge-router forward (manual — live device, not Terraform) + +TP-Link Archer AX6000 (`192.168.1.1`) → Advanced → NAT Forwarding → Port +Forwarding. The break-glass rule: + +| Service Name | Device IP | External Port | Internal Port | Protocol | +|---|---|---|---|---| +| `breakglass-ssh` | `192.168.1.127` | `52222` | `52222` | TCP | + +**AX6000 quirks (learned 2026-06-11 — do not relearn the hard way):** +- **External port must equal internal port.** The firmware rejects any remap + (e.g. `22 → 52222`) with *"External Port: This item conflicts with existed + ones."* Hence ext==int 52222. +- **Port 22 is reserved** — even `22 → 22` is refused. Break-glass cannot use 22. +- **Row delete is immediate** (no confirm dialog) — clicking the trash icon + removes the rule and toasts "Operation succeeded". +- Automation: `~/wizard/tools/insecure-browse/add-forward.{sh,js}` (dockerized + Playwright; double-gated save `DRY_RUN=0 CONFIRM_SAVE=1`; supports + `RULES_JSON` add, `EDIT_RULES_JSON` protocol-edit, `DELETE_RULES_JSON` + identity-guarded delete). Router password: Vault + `secret/viktor/edge_router_192_168_1_1_password`. + +## Rotate / revoke + +- **Revoke instantly:** remove the line from `/root/.ssh/authorized_keys.breakglass`. +- **Rotate the key:** `ssh-keygen -t ed25519 -a 100 -f ~/.ssh/breakglass_ed25519`, + `vault kv patch secret/viktor breakglass_ssh_privkey=@... breakglass_ssh_pubkey=...`, + redeploy step 1 above. +- **Router reset wipes forwards:** re-add the `breakglass-ssh` rule above. + +## History + +- **2026-05-30:** original design — key-only SSH on `:52222` gated behind a + **UDP port-knock** (knockd). Decommissioned 2026-06-11: the knock added no real + security (the SSH key already makes the port brute-force-proof) and its only + benefit — hiding the port — came at the cost of a **circular dependency**: the + knock sequence lived only in in-cluster Vault, unreachable in the exact + cold/away scenario break-glass exists for. That caused a real lockout. The + knockd package + config + the legacy Synology SSH forward (ext 3333 → .13:22) + were removed. diff --git a/docs/runbooks/breakglass-ui.md b/docs/runbooks/breakglass-ui.md new file mode 100644 index 00000000..a79c9f14 --- /dev/null +++ b/docs/runbooks/breakglass-ui.md @@ -0,0 +1,114 @@ +# Runbook: devvm breakglass UI (claude-breakglass) + +Last updated: 2026-06-12 + +## What this is + +`breakglass.viktorbarzin.me` — an in-cluster Claude-driven web UI for recovering +the **devvm** (Proxmox VM 102) when it is wedged but the cluster is healthy (the +**warm** case). You chat with a Claude agent that SSHes into the devvm to +diagnose/repair it, and there are manual buttons that power-cycle the VM via the +Proxmox host even if the Anthropic API is down. + +This is NOT the cold breakglass. If the **cluster or PVE host** is down, this UI +is down too (it's a cluster workload). For that case use the cold path: +- `ssh -p 52222 root@<wan>` → `qm stop 102 && qm start 102` (`docs/runbooks/breakglass-ssh.md`) +- `server-lifecycle` iDRAC CLI (192.168.1.4) to power-cycle the whole host. + +## Architecture + +``` +browser ─► Cloudflare ─► Traefik ─► auth-proxy (Authentik, basic-auth fallback) + └─► claude-breakglass Service (in-cluster) +claude-breakglass pod (ns claude-breakglass, own SA, NO Vault role): + • app.breakglass.server (FastAPI) serves the Svelte UI + /api + • chat → claude -p --agent breakglass (stream-json → SSE) + • ssh-agent holds the breakglass key (synced by ESO, never on disk) + • ssh devvm → breakglass@10.0.10.10 (full sudo) [diagnose/repair] + • ssh pve <verb> → root@192.168.1.127 forced-command [VM 102 power verbs] +``` + +Image: `forgejo.viktorbarzin.me/viktor/claude-agent-service:latest` (shared with +claude-agent-service; the deployment overrides the command with +`/srv/docker-entrypoint-breakglass.sh`). Code: `claude-agent-service/app/breakglass/`. +Stack: `stacks/claude-breakglass/`. ADR: `claude-agent-service/docs/adr/0001-*`. + +## Auth (how to get in) + +- **Normal:** Authentik SSO (you're already logged in to the SSO). +- **Authentik down:** the auth-proxy falls back to HTTP basic-auth ("Emergency + Access"). Username `admin`; password is the shared `auth_fallback_htpasswd` + (Vault `secret/platform`). This same credential gates every `auth="required"` + app. Rotate: regenerate the htpasswd, `vault kv patch secret/platform + auth_fallback_htpasswd=...`, apply the `traefik` stack (the auth-proxy rolls + on the `checksum/auth-proxy-htpasswd` annotation). + +## The PVE forced-command (the reset path) + +The breakglass SSH key's entry in PVE `/root/.ssh/authorized_keys` is pinned to +`command="/usr/local/bin/breakglass-pve",restrict,from="192.168.1.2"`. It only +accepts the bare verbs **`status | forensics | reset | stop | start | cycle`** +against VM 102 — anything else is rejected and logged to +`/var/log/breakglass-pve.log`. Every mutating verb captures forensics first. + +- **cycle** = stop→start (fresh QEMU, applies staged config) — the fix for a + QEMU I/O stall (2026-06-11). If a clean stop fails, it kills the wedged QEMU + PID then starts. **Prefer `cycle` over `reset` for a wedged VM.** +- `reset` is a warm reset (reuses QEMU) — only for a normal guest hang. + +Script source: `stacks/claude-breakglass/files/breakglass-pve` (deploy via +`scp … root@192.168.1.127:/usr/local/bin/breakglass-pve`). + +## NAT quirks (why `from=` differs per host) + +Discovered during bring-up — both verified from a real in-cluster pod: +- **pod → PVE (192.168.1.127):** pfSense SNATs inter-VLAN traffic to its + `192.168.1.2` interface, so PVE sees `192.168.1.2` for ALL cluster (and devvm) + SSH. Hence the PVE key uses `from="192.168.1.2"`. The devvm itself is NOT a + permitted source (it's the box being recovered). +- **pod → devvm (10.0.10.10):** the devvm sees the Calico-SNAT **node IP** + (10.0.20.0/24). Hence the devvm key uses `from="10.0.20.0/24"`. + +## Host bootstrap (one-time; redo on devvm rebuild / key rotation) + +The keypair lives in Vault `secret/claude-breakglass/ssh_key` +(`private_key`/`public_key`). To re-provision after a rebuild: + +```bash +PUB=$(vault kv get -field=public_key secret/claude-breakglass/ssh_key) + +# devvm (full-sudo recovery user): +sudo useradd -m -s /bin/bash breakglass 2>/dev/null || true +sudo install -d -m700 -o breakglass -g breakglass /home/breakglass/.ssh +printf 'from="10.0.20.0/24" %s\n' "$PUB" | sudo tee /home/breakglass/.ssh/authorized_keys +sudo chown breakglass:breakglass /home/breakglass/.ssh/authorized_keys +sudo chmod 600 /home/breakglass/.ssh/authorized_keys +echo 'breakglass ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/breakglass && sudo chmod 440 /etc/sudoers.d/breakglass + +# PVE (forced-command power verbs): +scp stacks/claude-breakglass/files/breakglass-pve root@192.168.1.127:/usr/local/bin/breakglass-pve +ssh root@192.168.1.127 chmod 0755 /usr/local/bin/breakglass-pve +# then append to /root/.ssh/authorized_keys on PVE: +# command="/usr/local/bin/breakglass-pve",restrict,from="192.168.1.2" <PUB> +``` + +Host-key checking is OFF in the pod's ssh config (a devvm rebuild rotates the +host key; strict checking would lock the breakglass out mid-incident — trusted +internal LAN, key auth stands). + +## Verify + +```bash +kubectl -n claude-breakglass get pods # Running +kubectl -n claude-breakglass logs deploy/claude-breakglass | grep -i ssh-add +curl -sk https://breakglass.viktorbarzin.me/health # (through the edge) +# from a pod, the PVE path: ssh pve status → "status: running" +``` + +## Isolation (why a separate deployment) + +The shared `claude-agent` pod runs agents that ingest untrusted input +(recruiter emails, nextcloud todos) with Bash. Co-locating the root-on-devvm key +there would let a prompt injection exfiltrate it. The breakglass runs in its own +namespace with its own SA and **no Vault role** (ESO syncs only its key); the +`terraform-state` Vault policy is explicitly DENIED `secret/claude-breakglass/*`. diff --git a/docs/runbooks/chrome-service-snapshot.md b/docs/runbooks/chrome-service-snapshot.md index ab065503..2ebc565f 100644 --- a/docs/runbooks/chrome-service-snapshot.md +++ b/docs/runbooks/chrome-service-snapshot.md @@ -11,8 +11,36 @@ external Claude Code sessions on the dev box. Architecture in | chrome-service Deployment | `chrome-service` ns | always-on | headed chromium, CDP :9222, persistent /profile/chromium-data | | snapshot-server sidecar | same pod | always-on | serves `/api/snapshot`, bearer-gated, port 8088 | | snapshot-harvester CronJob | `chrome-service` ns | `23 * * * *` | dumps `storage_state()` via CDP → `/profile/snapshots/storage-state.json` | -| dev-box refresh timer | each dev box | hourly | curls `chrome.viktorbarzin.me/api/snapshot` → `~/.cache/playwright-shared-storage-state.json` | -| dev-box `playwright-mcp.service` | each dev box | always-on | `@playwright/mcp --isolated --storage-state=…` per-MCP-connection contexts | +| dev-box refresh timer | each dev box, per OS user | hourly (`*:28`) | `playwright-snapshot-refresh@<user>.timer` curls `chrome.viktorbarzin.me/api/snapshot` → `~/.cache/playwright-shared-storage-state.json` | +| dev-box `playwright-mcp@<user>.service` | each dev box, per OS user | always-on | pinned `@playwright/mcp@<ver> --isolated --storage-state=…` on the user's `PLAYWRIGHT_PORT`; per-MCP-connection (per-session) contexts | + +## Provisioning (reproducible from git) + +The dev-box side is **per-OS-user** and fully reproducible — no hand-setup. +Each user gets their own isolated `@playwright/mcp` server (multiple concurrent +Claude sessions per user, isolated by `--isolated`), wired into their Claude in +**every directory** via a user-scope `~/.claude.json` entry +(`playwright → http://localhost:<PLAYWRIGHT_PORT>/mcp`). + +- **System-level template units** (NOT `systemd --user`, so no linger needed): + `playwright-mcp@.service` + `playwright-snapshot-refresh@.{service,timer}`, + sourced from `infra/scripts/workstation/playwright/`, installed to + `/etc/systemd/system/` by `setup-devvm.sh` (§9e). `User=%i`; per-user + `PLAYWRIGHT_PORT` from `/etc/t3-serve/playwright-<user>.env`. +- **Port allocation**: `roster_engine.py` (`PLAYWRIGHT_BASE_PORT=8931`, sticky) + — emitted in the derive JSON, written per-user by `t3-provision-users.sh` (§5c). +- **Snapshot token**: `setup-devvm.sh` (§8c) stages Vault + `secret/chrome-service` `api_bearer_token` → root file + `/etc/t3-serve/chrome-service-token`; the provisioner copies it (if-absent, + 0600) to each user's `~/.config/playwright/token` (the hourly root reconcile + has no Vault token, hence the staging — mirrors the Claude OAuth token in §8a). +- **MCP wiring + enablement**: `t3-provision-users.sh` `install_playwright()` runs + `claude mcp add --scope user … playwright` AS the user (clobber-proof, if-absent) + and `systemctl enable --now` the system instances. Idempotent; never restarts a + running instance or rewrites an existing `~/.claude.json` entry. +- **Pinned version**: bump `@playwright/mcp@<ver>` in + `scripts/workstation/playwright/playwright-mcp@.service` (the `@latest` → + silent-fleet-roll footgun is why; see the `T3_PIN` rationale in `setup-devvm.sh`). ## Day-to-day @@ -43,14 +71,14 @@ Expected: `wrote snapshot (… bytes) to /profile/snapshots/storage-state.json`. ### Trigger dev-box refresh manually ```bash -# On the dev box, as the user whose Claude Code sessions need the new state: -systemctl --user start playwright-snapshot-refresh.service +# On the dev box, refresh a specific user's snapshot (system template instance): +sudo systemctl start playwright-snapshot-refresh@<user>.service -# Or directly: -/usr/local/bin/playwright-snapshot-refresh +# Or run the script directly AS that user: +sudo -u <user> /usr/local/bin/playwright-snapshot-refresh # Verify -ls -la ~/.cache/playwright-shared-storage-state.json +sudo ls -la /home/<user>/.cache/playwright-shared-storage-state.json ``` ### Inspect the current snapshot @@ -108,12 +136,14 @@ The bearer token in `~/.config/playwright/token` doesn't match the server's. Almost always means the Vault secret was rotated and the local cache is stale. -**Fix**: +**Fix** (re-stage centrally so a rebuild stays correct, then re-copy to the user): ```bash vault login -method=oidc # if needed -vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token -chmod 600 ~/.config/playwright/token -systemctl --user start playwright-snapshot-refresh.service +sudo install -m 0600 <(vault kv get -field=api_bearer_token secret/chrome-service) \ + /etc/t3-serve/chrome-service-token +sudo install -o <user> -g <user> -m 0600 \ + /etc/t3-serve/chrome-service-token /home/<user>/.config/playwright/token +sudo systemctl start playwright-snapshot-refresh@<user>.service ``` ### Dev-box `playwright-snapshot-refresh` returns 404 with "snapshot not yet available" @@ -129,9 +159,9 @@ new context with it. **Existing MCP sessions don't hot-reload** — they keep the cookies they were seeded with at session start. New sessions get the fresh snapshot. -**Fix**: restart the MCP server on the dev box to pick up the new file: +**Fix**: restart the user's MCP server on the dev box to pick up the new file: ```bash -systemctl --user restart playwright-mcp.service +sudo systemctl restart playwright-mcp@<user>.service ``` ### Snapshot file is suspiciously small or empty cookies array @@ -158,13 +188,18 @@ vault kv put secret/chrome-service \ # Reloader auto-restarts chrome-service pod (snapshot-server picks up new token). -# On EVERY dev box that pulls the snapshot: -vault kv get -field=api_bearer_token secret/chrome-service > ~/.config/playwright/token -chmod 600 ~/.config/playwright/token +# On EVERY dev box: re-stage the root file, then overwrite each user's copy +# (the provisioner's per-user copy is if-absent, so a ROTATION must overwrite). +sudo install -m 0600 <(vault kv get -field=api_bearer_token secret/chrome-service) \ + /etc/t3-serve/chrome-service-token +for u in $(ls /etc/t3-serve/playwright-*.env 2>/dev/null | sed 's#.*/playwright-##;s#\.env##'); do + sudo install -o "$u" -g "$u" -m 0600 \ + /etc/t3-serve/chrome-service-token /home/"$u"/.config/playwright/token +done -# Verify the next refresh succeeds: -systemctl --user start playwright-snapshot-refresh.service -journalctl --user -u playwright-snapshot-refresh.service -n 20 +# Verify the next refresh succeeds for a user: +sudo systemctl start playwright-snapshot-refresh@<user>.service +sudo journalctl -u playwright-snapshot-refresh@<user>.service -n 20 ``` ## Restore from a backup tarball diff --git a/docs/runbooks/claude-auth-renew-workstation.md b/docs/runbooks/claude-auth-renew-workstation.md new file mode 100644 index 00000000..f5ce6625 --- /dev/null +++ b/docs/runbooks/claude-auth-renew-workstation.md @@ -0,0 +1,95 @@ +# Workstation Claude authentication renewal + +## Scope + +Every roster user authenticates Claude Code with their own Enterprise identity. +Credentials are never shared between OS users. Claude refreshes its normal OAuth +access token; `claude-auth-sync@<user>.timer` verifies that refresh using real +inference every six hours and backs up only the `claudeAiOauth` object to: + +```text +secret/workstation/claude-users/<os-user> +``` + +The user's unrelated `mcpOAuth` credentials never leave their home directory. +Each renewal service has a distinct 32-day periodic Vault token, mode `0600`, at +`~/.config/claude-auth-sync/vault-token`. Its policy can access only that user's +path. The service renews the Vault token on every run. + +## Normal lifecycle + +1. Add the user to `scripts/workstation/roster.yaml` and apply the Vault stack. +2. Run `scripts/workstation/setup-devvm.sh` as root with the admin Vault token. + Its foreground provisioner mints the isolated periodic token and enables the + user's timer. Routine hourly provisioning never needs an admin token. +3. The user completes one initial Enterprise login: + + ```bash + claude auth login --claudeai --sso --email <enterprise-email> + ``` + +4. Start the first sync immediately instead of waiting for the timer: + + ```bash + systemctl start claude-auth-sync@<os-user>.service + systemctl status claude-auth-sync@<os-user>.service + ``` + +Success writes no secrets to the journal. The user's private log records `OK` in +`~/.local/state/claude-auth-sync/sync.log`; journald receives the same status with +`identifier=claude-auth-sync` for Loki alerting. + +## Automatic recovery + +`claude auth status` is not a sufficient health check: it can report logged in +while inference returns HTTP 401. The service therefore runs a minimal Haiku +inference with no session persistence. On failure it: + +1. reads the user's latest OAuth object from Vault; +2. atomically merges it into `.credentials.json`, preserving MCP OAuth state; +3. retries inference once; +4. stores the newly refreshed OAuth object back in Vault on success. + +Vault KV version history remains available for audit, but the service deliberately +does not cycle through old refresh tokens: providers commonly invalidate rotated +refresh tokens, so replaying old versions can make recovery less deterministic. + +## Recovery requiring a person + +If both local state and the latest Vault copy fail, the refresh token was revoked, +invalidated, or the Enterprise session requires reauthorization. Run the login as +the affected OS user, then rerun the service: + +```bash +claude auth login --claudeai --sso --email <enterprise-email> +systemctl start claude-auth-sync@$(id -un).service +``` + +If the scoped Vault token expired or drift protection rejected it, rerun the root +provisioner with an admin Vault token after confirming the matching policy exists: + +```bash +export VAULT_ADDR=https://vault.viktorbarzin.me +export VAULT_TOKEN="$(cat /home/wizard/.vault-token)" +sudo --preserve-env=VAULT_ADDR,VAULT_TOKEN /usr/local/bin/t3-provision-users +``` + +Never copy another user's `.credentials.json` or scoped Vault token. Never restore +the old shared `CLAUDE_CODE_OAUTH_TOKEN`; environment credentials outrank per-user +login and would silently collapse all users onto one identity. + +## Verification + +```bash +systemctl list-timers 'claude-auth-sync@*' +systemctl status claude-auth-sync@<os-user>.service +journalctl -t claude-auth-sync --since today +``` + +Inspect Vault metadata, not secret values: + +```bash +vault kv metadata get secret/workstation/claude-users/<os-user> +``` + +Alert `WorkstationClaudeAuthInvalid` fires when any renewal agent logs `FAIL`. diff --git a/docs/runbooks/fan-control.md b/docs/runbooks/fan-control.md index 390c349a..d243b9fc 100644 --- a/docs/runbooks/fan-control.md +++ b/docs/runbooks/fan-control.md @@ -1,122 +1,103 @@ -# Runbook — PVE R730 fan-control daemon +# Runbook — PVE R730 fan control -Presence-aware IPMI fan controller on the PVE host (192.168.1.127). Runs the -CPU cool when the garage is empty, quiet when someone's in the garage. Design: -`infra/docs/plans/2026-06-04-pve-fan-control-design.md`. +**The control logic lives in Home Assistant; the PVE host runs only a thin +actuator.** HA computes the fan setpoint from the CPU temperature and the +dashboard inputs and publishes ONE number, `sensor.r730_fan_command_pct`. The +host daemon reads that number each loop and applies it over IPMI — it does **no** +math. Design + history: `infra/docs/plans/2026-06-04-pve-fan-control-design.md`. + +> **History:** (1) 2026-06-04/05 presence-aware two-curve controller (COOL/QUIET +> by garage door). (2) 2026-06-07 single linear curve, presence removed. +> (3) 2026-06-08 **all control moved into HA**, host became a thin actuator, +> additive **bias** replaced the ease-down hysteresis. (4) 2026-06-15 daemon +> **anti-flap**: holds the last command through transient HA losses instead of +> dumping to Dell auto. ## What it is -- `/usr/local/bin/fan-control` — bash daemon (source: `infra/scripts/fan-control.sh`). +- **HA (brain), on ha-sofia — NOT in this repo:** the `input_number` sliders, the + command template sensor, the display/equilibrium sensors, the Lock/Override + controls, and the dashboard cards. Auto-git-tracked on ha-sofia by the + version-control add-on. +- `/usr/local/bin/fan-control` — bash **actuator** (source: `infra/scripts/fan-control.sh`). - `fan-control.service` — systemd unit (`Type=simple`, restarts on failure). - `/etc/fan-control.env` — config incl. the ha-sofia token (chmod 600, not in git). -## HA control (Home Assistant) +## HA brain — where the curve lives (dashboard-it → "Server" view → Fans) -Drive the fans from **dashboard-it → "Server" view → Fans**. The view is -deliberately minimal — it shows the current **fan speed** (% of capacity + -absolute RPM) and two controls: +`sensor.r730_fan_command_pct` (template) computes: +`command% = clamp( curve(temp) + bias, 0..100 )`, where `curve(temp)` is a linear +ramp from `(Temp min, Duty min)` to `(Temp max, Duty max)` over +`sensor.r730_cpu_temperature`, plus an **asymmetric output deadband** (rise +immediately; ease down only once it would drop ≥ Hysteresis). When **Lock** is +on it outputs the Override % directly. -- **Override %** (`input_number.r730_fan_manual_pct`) — the fan % to hold. While - **unlocked** it continuously mirrors the live commanded fan %, so it always - shows the actual *absolute* speed and updates as the fan moves (NOT a stale - value or a delta) — `automation.r730_fan_override_track_live_speed_while_unlocked` - syncs it to `sensor.r730_fan_control_target` (guarded to ignore - unavailable/unknown). While **locked** it stops tracking and becomes your - editable setpoint. A readout under the slider shows the live `% · rpm`. -- **Lock — freeze speed** (`input_boolean.r730_fan_lock`) — turn the algorithm - off and hold a fixed speed. Toggling it **ON** snapshots the *current* - commanded % into Override and switches the daemon to `manual` - (`automation.r730_fan_lock_freeze_current_speed_resume_algo`); toggling it - **OFF** switches back to `auto`, resuming the presence curve. Fine-tune the - held % with Override while locked. A 🔒 reminder appears on the view while - locked. +**Inputs** (`input_number` sliders): `r730_fan_temp_min`, `r730_fan_temp_max`, +`r730_fan_duty_min`, `r730_fan_duty_max`, `r730_fan_bias` (flat % added on top — +guarantees a floor), `r730_fan_hysteresis` (output deadband %). +Slope = `(Duty max − Duty min)/(Temp max − Temp min)` — steeper/higher-bias/lower-Temp-min +⇒ lower steady-state CPU temp (it's a P controller; the curve sets the equilibrium). -Under the hood the daemon still reads `input_select.r730_fan_mode` -(auto/cool/quiet/manual) + `input_number.r730_fan_manual_pct` each loop; the Lock -toggle just drives `mode` between `manual` (locked) and `auto` (unlocked). -`cool`/`quiet` remain valid modes if set directly (via the entity) but are no -longer surfaced on the simplified dashboard. `CEILING` (83 °C) still overrides -everything → Dell auto, **even when locked**. A stale non-`auto` mode left while -*unlocked* still auto-reverts to `auto` after 60 min -(`automation.r730_fan_mode_auto_revert`, now a dormant safety net). An HA change -is applied within one daemon loop (~15 s). +**Manual override:** `input_boolean.r730_fan_lock` (Lock — freeze) + `input_number.r730_fan_manual_pct` (Override %). -Monitoring sensors on the same view: `sensor.r730_fan_speed` (redfish exporter), -`sensor.r730_fan_control_target` + `sensor.r730_fan_control_mode` + -`sensor.r730_fan_power_est` (Pushgateway). Fan **% and RPM are merged into one -"Fan speed" card** (the two had identical trend shapes) — the % trend comes from -the stable Pushgateway sensor, while RPM reads `sensor.r730_fan_speed` but **falls -back to a calibrated estimate (shown with a `~` prefix) whenever the Redfish -sensor is `unavailable`** (it blips out intermittently), so the readout never goes -blank. `r730_fan_power_est` is an ESTIMATE of -total fan power (the iDRAC reports no per-fan power) — modelled from RPM via the -fan affinity law (∝ RPM³), calibrated to the power sweep (~2 W floor → ~99 W full). +**Readout sensors:** `sensor.r730_fan_command_display` ("Fan set point", "X % (Y rpm)"), +`sensor.r730_expected_equilibrium_temp` (predicted equilibrium at current load), +`sensor.r730_cpu_load`, `sensor.r730_fan_speed_avg` (mean of 6 fans), +`sensor.r730_fan_power_avg` (cube-law estimate). The Prometheus-backed REST +sensors live in `rest_resources/idrac_redfish_exporter.yaml` on ha-sofia and have +value-template fallbacks so they don't blink `unavailable` on a transient empty. -The HA objects (helpers, the auto-revert automation, the REST sensors in -`rest_resources/{idrac_redfish_exporter,fan_control}.yaml`, and the dashboard -cards) live on **ha-sofia** and are auto-git-tracked there by the version-control -add-on — they are NOT in this repo. +## Actuator (host) — what the daemon does + +Loop every ~15 s, using only the existing IPMI + HA-REST methods: +1. read `command%` from HA (`/api/states/$COMMAND_ENTITY`), validate (numeric + not stale > `STALE_SECS`); +2. apply it via `ipmitool raw 0x30 0x30 0x02 0xff 0x<NN>` (writes only if the change clears `MIN_STEP`); +3. read CPU temp + fan rpm for safety + telemetry (Pushgateway). + +**Anti-flap:** on a missing/stale command it **holds the last applied %** for up +to `HA_GRACE_SECS` (300 s) instead of falling back; only sustained loss hands the +fans to Dell auto. + +## Safety (on the host, independent of HA) +`CPU ≥ CEILING (83 °C)`, repeated IPMI failures, sustained HA loss, or daemon +stop/crash → hand the fans back to **Dell auto** (`raw 0x30 0x30 0x01 0x01`; +EXIT trap + systemd `ExecStopPost`). The 83 °C ceiling uses the daemon's own +IPMI temp read, so it protects even if HA is wrong/unreachable. ## Quick status - ```bash ssh root@192.168.1.127 systemctl status fan-control ssh root@192.168.1.127 'journalctl -u fan-control -n 30 --no-pager' -ssh root@192.168.1.127 'ipmitool sdr type fan | grep ^Fan1; ipmitool sdr type temperature | grep "^Temp "' ``` -Log lines look like `temp=60C ha_mode=auto eff=cool fan=50% (was 70%)` -(`ha_mode` = the HA setpoint; `eff` = the effective curve applied). - -## Disable / roll back to stock firmware control - -```bash -ssh root@192.168.1.127 'systemctl disable --now fan-control && ipmitool raw 0x30 0x30 0x01 0x01' -``` -The unit's `ExecStopPost` already restores Dell auto on stop, so the explicit -`raw ... 0x01` is belt-and-suspenders. The box is back to its stock curve. +Log line: `temp=64C cmd=49% rpm=9380 (was -1%)` (`cmd` = the % read from HA and +applied). `HA command miss — holding 49%` = a transient HA blip being ridden out; +`HA command lost (...) — Dell auto` = sustained loss. ## Tune +The whole curve (anchors + bias + hysteresis) is tuned **live from the HA +dashboard** — no host access needed. `/etc/fan-control.env` only holds the +actuator plumbing + safety knobs (`COMMAND_ENTITY`, `STALE_SECS`, `HA_GRACE_SECS`, +`MIN_STEP`, `CEILING`); edit it then `systemctl restart fan-control`. -Edit `/etc/fan-control.env` on the host, then `systemctl restart fan-control`. -Common knobs: -- `HOLD_SECS` — how long to stay quiet after the garage door last moved (default 900 = 15 min). -- `CEILING` — temp at which we abandon manual control and let the firmware take over (default 83). -- Curve shape: **linear anchors** near the top of the script — `COOL_T_LO/COOL_P_LO/COOL_T_HI/COOL_P_HI` (default 50°C/30% → 83°C/100%) and `QUIET_*` (68°C/20% → 83°C/100%); fan% interpolates linearly between them (replaced the old discrete step-bands). `MIN_STEP` (default 3%) = smallest fan-% change worth an IPMI write (anti-jitter); `DEADBAND` (3°C) = ease-down hysteresis. Lower `COOL_P_HI` or raise `COOL_T_HI` to run the top end quieter; steepen by raising `COOL_P_LO` / lowering `COOL_T_LO`. - -## Deploy / update - +## Deploy / update (daemon source) ```bash -cd infra -scp scripts/fan-control.sh root@192.168.1.127:/usr/local/bin/fan-control -ssh root@192.168.1.127 chmod +x /usr/local/bin/fan-control -scp scripts/fan-control.service root@192.168.1.127:/etc/systemd/system/fan-control.service -# first install only — create /etc/fan-control.env from fan-control.env.example with the HA token -ssh root@192.168.1.127 'systemctl daemon-reload && systemctl restart fan-control' +scp -i ~/.ssh/pve_root scripts/fan-control.sh root@192.168.1.127:/tmp/fan-control.new +ssh -i ~/.ssh/pve_root root@192.168.1.127 'install -m0755 /tmp/fan-control.new /usr/local/bin/fan-control && systemctl restart fan-control' ``` - -## HA token - -`/etc/fan-control.env` holds a long-lived ha-sofia token used to read -`sensor.garage_door_state_bg`. Mint via Home Assistant → Profile → Security → -Long-lived access tokens, or reuse the existing ha-sofia token. If the token is -missing/empty, the daemon still runs but **COOL-only** (no quiet mode) and logs -`ha_reachable=0`. +(`fan-control.service` only on a unit change → also `systemctl daemon-reload`.) ## Symptoms & checks - | Symptom | Check | |---------|-------| -| Fans stuck loud | `journalctl -u fan-control` — is `mode=fallback`? (ceiling breach or IPMI fail). Check CPU temp. | -| Never goes quiet | Token valid? `curl -H "Authorization: Bearer $TOKEN" http://192.168.1.8:8123/api/states/sensor.garage_door_state_bg`. Garage door reporting? | -| Fans flapping | Increase `DEADBAND`. | -| Service won't start | `systemctl status fan-control`; check `ipmitool` works: `ipmitool sdr type temperature`. | +| Fans surge then crash to ~7100 then surge | flapping to Dell auto — `journalctl -u fan-control \| grep -E 'holding\|Dell auto'`; pre-2026-06-15 this was the stale-command bug (now fixed). | +| Fans stuck loud | `journalctl` — `CEILING` breach or `HA command lost`? Check CPU temp + HA reachability. | +| A readout blinks `unavailable` | the REST value-template fallback should hold it; a 1×/8h blip at ~02:00 (backup window) is a benign fetch hiccup. | +| Slider changes ignored | does `sensor.r730_fan_command_pct` change in HA? token valid? | | Box left in manual after crash | `ipmitool raw 0x30 0x30 0x01 0x01` to force Dell auto. | -## Verify presence wiring - +## Verify wiring ```bash -# one iteration, real IPMI + HA, no daemon loop: -ssh root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control' +ssh -i ~/.ssh/pve_root root@192.168.1.127 'set -a; . /etc/fan-control.env; set +a; RUN_ONCE=1 /usr/local/bin/fan-control' ``` -With the garage closed for >15 min you should see `mode=cool`; within 15 min of -the door moving, `mode=quiet`. +The log `cmd=%` should equal `sensor.r730_fan_command_pct`. Move a slider so the +HA sensor changes, re-run, and the applied `cmd=%` should follow. diff --git a/docs/runbooks/forgejo-open-signups.md b/docs/runbooks/forgejo-open-signups.md new file mode 100644 index 00000000..5a00d15a --- /dev/null +++ b/docs/runbooks/forgejo-open-signups.md @@ -0,0 +1,168 @@ +# Runbook: Forgejo open self-service signups + +Last updated: 2026-06-19 + +`forgejo.viktorbarzin.me` allows **open native self-registration** (anyone can +create a local Forgejo account from the web form), gated against bots by two +layers: + +1. **Cloudflare Turnstile** captcha on the registration form. +2. **Mandatory email confirmation** — a new account stays inactive until the + user clicks an activation link emailed to the address they registered with. + +Two external login sources also work alongside local accounts: the pre-existing +**Sign in with GitHub** OAuth2 login (the **Authentik OAuth2 source is now DISABLED** — see the GitHub section below) (see the GitHub +section below). Opening local signups was additive — it did not touch SSO. + +Most of this is Terraform-managed in `stacks/forgejo/`. The one exception is the +OAuth2 login *sources* (Authentik, GitHub), which live in Forgejo's own DB and +are added via `forgejo admin auth` — there is no clean Terraform resource for +them (their secrets are mirrored to Vault for recovery). + +## What is configured (and where) + +All on the `kubernetes_deployment.forgejo` container env in +`stacks/forgejo/main.tf` (Forgejo reads `app.ini` keys from `FORGEJO__<section>__<KEY>` +env vars): + +| Setting | Value | Effect | +|---|---|---| +| `service.DISABLE_REGISTRATION` | `false` | Registration is enabled | +| `service.ALLOW_ONLY_EXTERNAL_REGISTRATION` | `false` | Native local sign-up allowed (was `true` = OAuth-only) | +| `service.ENABLE_CAPTCHA` | `true` | Captcha required on the signup form | +| `service.CAPTCHA_TYPE` | `cfturnstile` | Cloudflare Turnstile | +| `service.CF_TURNSTILE_SITEKEY` | widget id | Public; rendered in the page | +| `service.CF_TURNSTILE_SECRET` | from `forgejo-turnstile` Secret | Server-side verification | +| `service.REGISTER_EMAIL_CONFIRM` | `true` | Account inactive until email is confirmed | +| `mailer.*` | see below | Sends the activation email | +| `oauth2_client.ENABLE_AUTO_REGISTRATION` | `true` | First GitHub (OAuth2) sign-in auto-creates the account | + +Captcha guards **registration only** — `REQUIRE_CAPTCHA_FOR_LOGIN` is left at the +default `false`, so existing users are not captcha'd on every login. + +## Cloudflare Turnstile widget — `turnstile.tf` + +- The widget is a Terraform resource: `cloudflare_turnstile_widget.forgejo_signup` + (mode `managed`, domain `forgejo.viktorbarzin.me`), created with the CF Global + API Key already wired in `cloudflare_provider.tf`. The account id is resolved + via `data.cloudflare_accounts`. +- `.id` is the **public sitekey** (passed as a plain env value). `.secret` is the + **secret key**, stored in the `forgejo-turnstile` K8s Secret and injected via + `secret_key_ref`. The secret also lives in TF state (Tier-1 PG, encrypted at + rest) — same trust level as the CF API key already in state. +- Forgejo is **non-proxied** (direct A record to Traefik), but Turnstile is a + client-side JS widget served from `challenges.cloudflare.com`, so proxy status + is irrelevant — the widget works regardless. + +**Rotate the widget secret** (e.g. if it leaks): +``` +cd stacks/forgejo && vault login -method=oidc +../../scripts/tg apply --non-interactive -replace=cloudflare_turnstile_widget.forgejo_signup +``` +This mints a new sitekey+secret, updates the `forgejo-turnstile` Secret, and (via +the Reloader annotation) rolls the Forgejo pod. Verify the new sitekey appears in +the `/user/sign_up` HTML afterwards. + +## Mailer — `email-secret.tf` + `[mailer]` env + +- Forgejo sends as **`noreply@viktorbarzin.me`** via **`mail.viktorbarzin.me:587`** + with `PROTOCOL=smtp+starttls`. This reuses the same mailserver SASL account + Authentik uses (`stacks/authentik/email-secret.tf`) — one credential, one + rotation point. +- **The host MUST be `mail.viktorbarzin.me`, not `mailserver.mailserver.svc`**: + the mailserver serves the `*.viktorbarzin.me` wildcard cert, which does not + cover the `.svc` DNS name, so STARTTLS cert verification would fail. + `mail.viktorbarzin.me` resolves in-cluster (→ `10.0.20.1`) and matches the cert. +- The password is synced from Vault `secret/authentik` → `smtp_password` by the + `forgejo-email` ExternalSecret (ESO `ClusterSecretStore vault-kv`) into the + `forgejo-email` K8s Secret (key `PASSWD`), referenced by `FORGEJO__mailer__PASSWD`. +- The deployment carries `reloader.stakater.com/auto: "true"`, so a rotation of + either secret rolls the pod automatically. + +## GitHub sign-in (OAuth2 source) + +People can **sign up / sign in with GitHub** — the active Forgejo OAuth2 source. GitHub sign-up is **zero-click** (auto-registration creates the account on first login). + +> **Authentik is DISABLED on purpose** (2026-06-19). `ENABLE_AUTO_REGISTRATION` is GLOBAL across OAuth sources, and Authentik's `preferred_username` claim is the user's **email** — invalid as a Forgejo username, which 500'd auto-create. Viktor's Forgejo email (`me@viktorbarzin.me`) does not match his Authentik email (`vbarzin@gmail.com`), so account-linking can't bridge it. Per his directive GitHub was prioritised; the Authentik source was deactivated via `UPDATE login_source SET is_active=0 WHERE name='Authentik'` in the forgejo MySQL DB. **Re-enable** with `is_active=1` after fixing Authentik's username claim. + +- **Source** (Forgejo DB, *not* Terraform — added via CLI, same as Authentik): + ``` + forgejo admin auth add-oauth --name github --provider github --key <client-id> --secret <client-secret> + ``` + The source **name must stay `github`** — it is part of the callback URL + (`/user/oauth2/github/callback`) registered on the GitHub side, so renaming it + breaks the callback. `forgejo admin auth list` shows it (ID 2). +- **GitHub OAuth App**: a classic OAuth App under the ViktorBarzin GitHub account + (Settings → Developer settings → OAuth Apps). Homepage + `https://forgejo.viktorbarzin.me`, callback + `https://forgejo.viktorbarzin.me/user/oauth2/github/callback`. GitHub has **no + API to create OAuth Apps** — creating it is a browser-only step. +- **Credentials**: Vault `secret/viktor` → `forgejo_github_oauth_client_id` / + `forgejo_github_oauth_client_secret` (kept for recovery; the live values are in + Forgejo's DB). +- **Auto-registration**: `FORGEJO__oauth2_client__ENABLE_AUTO_REGISTRATION=true` + (`main.tf`) makes a first GitHub login create the account directly. The GitHub + identity is the trust gate for this path — the Turnstile captcha + email + confirmation only apply to the **native** signup form, not OAuth. + +**Rotate the GitHub client secret** — generate a new one in the GitHub OAuth App, then: +``` +vault kv patch secret/viktor forgejo_github_oauth_client_secret=<new> +POD=$(kubectl -n forgejo get pod -l app=forgejo -o jsonpath='{.items[0].metadata.name}') +kubectl -n forgejo exec "$POD" -- su-exec git forgejo admin auth update-oauth --id 2 --secret <new> +``` +(Source id from `forgejo admin auth list`.) + +**Recreate after a Forgejo DB loss**: the source is not in Terraform, so after a +from-scratch restore, re-run the `add-oauth` command above with the Vault creds. + +## Re-closing / tightening signups + +Edit `stacks/forgejo/main.tf` and `scripts/tg apply` (or commit + push — CI +applies): + +- **OAuth-only again** (revert this change): set + `FORGEJO__service__ALLOW_ONLY_EXTERNAL_REGISTRATION` back to `"true"`. +- **No new accounts at all** (admins create them): set + `FORGEJO__service__DISABLE_REGISTRATION` to `"true"`. +- **Require admin approval per signup** (strongest, instead of email confirm): + set `REGISTER_MANUAL_CONFIRM=true` **and** `REGISTER_EMAIL_CONFIRM=false` + (Forgejo makes the two mutually exclusive). New accounts then queue under Site + Administration → Identity & Access → Accounts until an admin activates them. + +## Handling spam / abuse accounts + +A signup that clears Turnstile + email confirmation is still a real, low-privilege +Forgejo user. To deal with abuse: +- **Ban/delete** via Site Administration → Identity & Access → Accounts, or + `forgejo admin user delete --username <name>` inside the pod + (`kubectl -n forgejo exec deploy/forgejo -- forgejo admin user ...`). +- New users get Forgejo defaults (they can create repos/orgs). If abuse warrants, + tighten with `[service].DEFAULT_ALLOW_CREATE_ORGANIZATION=false` and/or + `[repository].MAX_CREATION_LIMIT` (add as env vars; out of scope for the initial + open-signups change). + +## Operational notes + +- The Forgejo deployment is **single-replica with `Recreate` strategy**, so any + config apply briefly restarts the pod (git remote + OCI registry unavailable for + a few seconds). Expected, not an incident. +- The signup page is **not** behind Cloudflare's bot-fight (Forgejo is + non-proxied) — Turnstile + email confirmation are the bot gate. CrowdSec + + Traefik rate limiting still front the host. + +## Verify it's working + +``` +POD=$(kubectl -n forgejo get pod -l app=forgejo -o jsonpath='{.items[0].metadata.name}') +# Env present: +kubectl -n forgejo exec "$POD" -- env | grep -E 'ALLOW_ONLY_EXTERNAL|ENABLE_CAPTCHA|CAPTCHA_TYPE|CF_TURNSTILE_SITEKEY|REGISTER_EMAIL_CONFIRM|mailer__ENABLED' +# Turnstile widget rendered on the form: +kubectl -n forgejo exec "$POD" -- wget -qO- http://localhost:3000/user/sign_up | grep -oE 'cf-turnstile|data-sitekey="[^"]*"' +# Secrets healthy: +kubectl -n forgejo get externalsecret forgejo-email +kubectl -n forgejo get secret forgejo-email forgejo-turnstile +``` +A full real-world check is to register a throwaway account and confirm the +activation email arrives. The mailer transport (server/port/cert/cred) is shared +with Authentik, which is already in production for external user enrollment. diff --git a/docs/runbooks/forgejo-registry-setup.md b/docs/runbooks/forgejo-registry-setup.md index 16637c6d..aace9e76 100644 --- a/docs/runbooks/forgejo-registry-setup.md +++ b/docs/runbooks/forgejo-registry-setup.md @@ -119,8 +119,11 @@ cd infra/stacks/kyverno && scripts/tg apply cd infra/stacks/monitoring && scripts/tg apply cd infra/stacks/forgejo && scripts/tg apply -# Containerd hosts.toml on each existing k8s node — VM cloud-init -# only fires on first boot. +# Resolved routing domain (+ vestigial containerd hosts.toml) on each +# existing k8s node — VM cloud-init only fires on first boot. The routing +# domain (~viktorbarzin.me -> Technitium) is what makes pulls hairpin-proof: +# the hosts.toml mirror alone falls back to public DNS (Traefik 404s its +# bare-IP requests, and the registry auth realm is an absolute public URL). infra/scripts/setup-forgejo-containerd-mirror.sh ``` @@ -135,7 +138,11 @@ docker pull alpine:3.20 docker tag alpine:3.20 forgejo.viktorbarzin.me/viktor/smoketest:1 docker push forgejo.viktorbarzin.me/viktor/smoketest:1 -# Pull from a k8s node. +# Per-node pull path: routing domain active + name resolves to the live +# Traefik LB (via Technitium split-horizon zone) + pull works. +ssh wizard@<node> 'resolvectl status | grep -A2 "~viktorbarzin.me"; getent hosts forgejo.viktorbarzin.me' +# Expect: DNS Domain ~viktorbarzin.me on server 10.0.20.201, and +# getent -> the current Traefik LB IP (10.0.20.203 today) ssh wizard@<node> sudo crictl pull forgejo.viktorbarzin.me/viktor/smoketest:1 # Confirm the cluster-wide Secret was synced into a fresh namespace. diff --git a/docs/runbooks/k8s-version-upgrade.md b/docs/runbooks/k8s-version-upgrade.md index 847d2462..021c588f 100644 --- a/docs/runbooks/k8s-version-upgrade.md +++ b/docs/runbooks/k8s-version-upgrade.md @@ -2,21 +2,21 @@ ## Overview -Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 5 K8s -VMs are upgraded automatically by a weekly detection CronJob that seeds a -chain of small phase Jobs. Each Job is **pinned to a node that is NOT its +Kubernetes component versions (`kubeadm`/`kubelet`/`kubectl`) on the 7 K8s +nodes (k8s-master + k8s-node1..6) are upgraded automatically by a nightly +detection CronJob that seeds a chain of small phase Jobs. Each Job is **pinned to a node that is NOT its drain target** — so no pod in the chain can preempt itself. -The chain (Sun 12:00 UTC weekly): +The chain (23:00 UTC nightly): ``` -detection CronJob → preflight Job → master Job → worker × 4 Jobs → postflight Job +detection CronJob → preflight Job → master Job → one worker Job per worker (enumerated live) → postflight Job ``` This is **independent** of the OS-side `unattended-upgrades + kured` pipeline (see `k8s-node-auto-upgrades.md`). They do not share rollouts. Schedules can overlap (kured runs daily 02:00-06:00 London; detection -here runs Sun 12:00 UTC) — when a kured reboot lands within 24h of the +here runs 23:00 UTC nightly) — when a kured reboot lands within 24h of the Sunday detection, the `RecentNodeReboot` alert in the Upgrade Gates group blocks the version-upgrade preflight, so the chain self-defers to the next Sunday rather than rolling on top of a half-fresh node. @@ -24,7 +24,7 @@ to the next Sunday rather than rolling on top of a half-fresh node. ## Architecture ``` -k8s-version-check CronJob (Sun 12:00 UTC, k8s-upgrade ns, SA: k8s-upgrade-job) +k8s-version-check CronJob (23:00 UTC nightly, k8s-upgrade ns, SA: k8s-upgrade-job) │ kubectl get nodes → running version │ ssh master 'apt-cache madison kubeadm' → latest patch (within current minor) │ HEAD pkgs.k8s.io/.../v<NEXT_MINOR>/deb/Release → next minor available? @@ -36,14 +36,17 @@ envsubst on /template/job-template.yaml | kubectl apply -f - ▼ Job 0 — preflight (pinned: k8s-node1) + ├── compat-gate: addon/API/containerd support for target (else BLOCK+alert) ├── All nodes Ready + no Mem/Disk pressure ├── halt-on-alert (kured-style ignore-list) ├── 24h-quiet baseline (no Ready transitions <24h ago) - ├── kubeadm upgrade plan matches target + ├── kubeadm upgrade plan matches target (skipped when master already at target — partial-resume) + ├── apiserver-OIDC drift check: kubeadm upgrade diff drops --authentication-config? → Slack WARN (recoverable; not a block) + ├── reclaim kubeadm scratch: prune /etc/kubernetes/tmp/kubeadm-backup-* >3d on master (kubeadm leaks ~400MB etcd-db backups) ├── Push k8s_upgrade_in_flight=1, k8s_upgrade_started_timestamp=$(date +%s) ├── Trigger backup-etcd Job, wait, verify snapshot byte count ├── SSH master: containerd skew fix (if master < workers) - ├── SSH all 5 nodes: apt repo URL rewrite (only kind=minor) + ├── SSH all 7 nodes: apt repo URL rewrite (only kind=minor) └── spawn_next → k8s-upgrade-master-<target_version> ▼ @@ -76,14 +79,69 @@ Job 6 — postflight (no pinning) └── Slack: ✅ K8s upgrade complete ``` -**Pin choices summarised:** -- k8s-node1 hosts every Job that drains master or another worker. k8s-node1 - itself is upgraded **last**. -- k8s-master hosts Job 5 (which drains k8s-node1). Job 5's spec includes a - toleration for `node-role.kubernetes.io/control-plane:NoSchedule`. -- If anyone reorders the worker sequence, the pin for Job 5 needs to track - whatever worker is upgraded last. The mapping is in `scripts/upgrade-step.sh` - → the `case "${PHASE}:${TARGET_NODE:-}"` block. +**Pin choices summarised** (dynamic since 2026-06-17 — no hardcoded node list): +- The **master-drain Job** runs on the first worker (the "runner"); since it + drains the control-plane node, it must not run there. +- **Every worker-drain Job** runs on **k8s-master** (already upgraded by then), + with a `node-role.kubernetes.io/control-plane:NoSchedule` toleration — so a + Job never runs on the node it drains (self-preemption invariant). +- The worker set + order come from `kubectl get nodes` at runtime + (`worker_nodes` / `next_pending_worker` in `scripts/upgrade-step.sh`), so + **adding a node needs no change** — the chain upgrades every worker still + off-target, then runs postflight. SSH uses node InternalIPs (no DNS needed). + +### Auto-upgrade compat gate + +The chain now attempts **patch AND minor** upgrades autonomously — but before any +mutation, `phase_preflight` runs `compat-gate.py` **FIRST** and **REFUSES (blocks) +the upgrade** if any of these hold for the detected target: + +- a **critical addon's running version doesn't support the target k8s minor** + (running version > the addon's highest-supported minor in the compat matrix), +- an **in-use deprecated API is removed at/before the target** — measured live + from `apiserver_requested_deprecated_apis` (something is still calling a + group/version that the target k8s drops), or +- a **node's containerd is below the target's floor** (the minimum containerd the + target k8s requires). + +The addon check is **scoped to minor jumps**: a target **at or below the running +k8s minor** (a patch) crosses into no new minor, so the running cluster is itself +proof the installed addons work there — `compat-gate.py` skips the addon ceilings +when `target_minor <= running_minor`. (Without this a conservative ceiling such as +ESO 0.12 → 1.31 would false-block a 1.34.x **patch** on a cluster already running +1.34 — fixed 2026-06-20.) The deprecated-API and containerd checks are naturally +inert for a patch (no API removal or containerd floor occurs inside a minor). + +This is the **"auto-upgrade when we can, halt + alert when we can't"** contract. + +**On a block**, the gate: +- pushes `k8s_upgrade_blocked=1` to Pushgateway (→ the `K8sUpgradeBlocked` + Prometheus alert), +- Slacks the **specific reasons** (which addon/API/node, current vs required), and +- **halts the chain** — it exits **non-fatal** (the upgrade simply isn't safe yet, + this is not a failure). Because the block happens **before any mutation, no + rollback is involved**; nothing was changed. + +**To clear a block**: upgrade the named addon (or migrate the API caller off the +deprecated group/version, or bump containerd on the named node) so the offending +condition no longer holds. The **next nightly run then proceeds automatically** — +no manual chain restart needed. + +The **compat matrix** lives in +`stacks/k8s-version-upgrade/scripts/addon-compat.json` — a map of `addon → highest +supported k8s minor`, populated from each addon's own compatibility docs. **Keep +it current**; the gate reads it on every run. Gate logic: +`stacks/k8s-version-upgrade/scripts/compat-gate.py`. + +> **Both** detector probes against `pkgs.k8s.io` follow the 302 redirect via `-L`: +> the next-minor *availability* probe (`HEAD .../v<NEXT_MINOR>/deb/Release`) **and** +> the next-minor *patch* probe (`GET .../v<NEXT_MINOR>/deb/Packages`, which resolves +> the exact `X.Y.Z`). The Packages probe lacked `-L` until 2026-06-20 — `pkgs.k8s.io` +> 302-redirects every request, so without it curl returned an empty body, +> `NEXT_MINOR_PATCH` came back empty, and the detector silently fell through to +> "No upgrade needed". That is why the **2026-06-19 nightly run no-op'd** instead of +> resolving the 1.35 target. With both probes on `-L`, **minor versions are detected** +> and gated behind the compat check above before the chain acts on them. ## Components @@ -95,7 +153,7 @@ Job 6 — postflight (no pinning) | **ConfigMap `k8s-upgrade-job-template`** | Mounts `/template/job-template.yaml` — universal Job manifest with envsubst placeholders. Rendered by upgrade-step.sh and the detection CronJob via `envsubst | kubectl apply`. | | **ServiceAccount `k8s-upgrade-job`** | Used by both the detection CronJob and every chain Job. ClusterRole binding grants: nodes get/list/patch, pods/eviction create, pods delete, batch/jobs CRUD, PDB list (for predrain_unstick), CronJob get (snapshot trigger), namespaces patch on `k8s-upgrade` only. Namespace-scoped Role binding grants secrets:get on `k8s-upgrade-creds`. | | **ExternalSecret `k8s-upgrade-creds`** | Syncs `secret/k8s-upgrade/{ssh_key, slack_webhook}` from Vault. Mounted into every Job at `/secrets/k8s-upgrade`. | -| **CronJob `k8s-version-check`** | Sun 12:00 UTC. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. | +| **CronJob `k8s-version-check`** | 23:00 UTC nightly. Probes apt + pkgs.k8s.io for target. If found, renders Job 0 from `job-template.yaml` and applies it. | ### Pushgateway metrics @@ -115,7 +173,46 @@ Pushed by upgrade-step.sh during phase execution; observed by the - **`K8sVersionSkew`** — distinct kubelet/apiserver `gitVersion` count > 1 for 30m. Catches a half-done rollout. - **`EtcdPreUpgradeSnapshotMissing`** — `k8s_upgrade_in_flight==1 && k8s_upgrade_snapshot_taken==0` for 10m. Catches preflight Stage 2 failing silently. - **`K8sUpgradeStalled`** — `k8s_upgrade_in_flight==1 && time()-k8s_upgrade_started_timestamp > 5400` for 5m. Catches a Job in the chain dying without spawning its successor. -- All three alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. +- **`K8sUpgradeChainJobFailed`** — `(kube_job_status_failed{namespace="k8s-upgrade",job_name=~"k8s-upgrade-(preflight|master|worker|postflight)-.*",reason=~"BackoffLimitExceeded|DeadlineExceeded"} > 0) unless on() (k8s_upgrade_blocked == 1)` for 15m (warning). Catches a phase Job that **terminally failed before `k8s_upgrade_in_flight` was set** — the preflight gates exit pre-metric, so the two `in_flight`-based alerts above are blind to a failed preflight (this is what hid the 5-day 1.34.9 wedge on 2026-06-12). Reason-scoped to terminal job conditions so a retry-success doesn't false-positive (a bare failed-pod-count would otherwise also block kured for the Job's 7d TTL). The `unless k8s_upgrade_blocked == 1` clause (added 2026-06-21) excludes a preflight that failed because the **compat gate deliberately refused** the target — that's owned by `K8sUpgradeBlocked` and was double-firing here; a genuine wedge exits without setting the blocked gauge, so it still fires. +- **`K8sUpgradeBlocked`** — `k8s_upgrade_blocked == 1` (warning). A k8s **auto-upgrade was refused** by the compat gate because a critical addon, an in-use deprecated API, or a node's containerd is too old for the detected target. The **specific reasons are in Slack**; clear it by upgrading the named addon / migrating the API caller / bumping containerd, after which the next nightly run proceeds (see "Auto-upgrade compat gate"). No upgrade was attempted, so this is not a half-done-rollout alert. +- The first four alerts ALSO block kured (same `--prometheus-url` halt-on-alert mechanism) so the OS-reboot pipeline can't run on top of a half-done version upgrade. + +### Nightly upgrade report (Slack) + +CronJob `k8s-upgrade-nightly-report` (k8s-upgrade ns, `var.report_schedule`, +default `7 6 * * *` = 06:07 UTC — after the 23:00 chain, before the 08:00 London +alert-digest) posts ONE Slack summary each morning of the previous night's run: +running version, detector freshness, detected target + kind, the outcome +(⚪ no upgrade needed / 🔴 blocked + live blocker reasons / 🟢 upgraded / +🟡 in progress / ⚠️ detector stale), and recent chain jobs. Read-only — it reads +the Pushgateway gauges + live nodes/jobs and re-runs `compat-gate.py` for fresh +blocker reasons; reuses the chain's SA + `slack_webhook` + scripts ConfigMap. +Logic + unit tests: `scripts/nightly-report.py`, `scripts/test_nightly_report.py`. +This is the day-to-day visibility layer (it does NOT replace the alerts above — +those fire on problems; this reports the outcome every night). Manual run: +`kubectl -n k8s-upgrade create job --from=cronjob/k8s-upgrade-nightly-report nightly-report-test` +(name it WITHOUT a `k8s-upgrade-{phase}-` prefix so a failure can't trip +`K8sUpgradeChainJobFailed`). + +### CoreDNS is NOT upgraded by kubeadm here + +CoreDNS runs a **custom split-horizon Corefile** (owned by the technitium stack) +and its image is tracked separately — it must NOT be touched by kubeadm. The +master `kubeadm upgrade apply` therefore runs with +`--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins +--skip-phases=addon/coredns` (in `scripts/update_k8s.sh`), so kubeadm upgrades +the control plane but leaves CoreDNS 100% untouched (image + Corefile). Without +the `--skip-phases`, forcing past the preflight makes kubeadm overwrite the +Corefile with its default and downgrade the image (verified via +`kubeadm upgrade apply --dry-run`). + +**Keep CoreDNS off Keel.** On 2026-06-12 Keel had auto-bumped CoreDNS +v1.12.1 → v1.12.4 (kube-system out-of-band annotation from the 2026-05-26 Keel +cascade), and 1.12.4 is ahead of kubeadm 1.34.9's corefile-migration table — +which is what blocked the 1.34.9 upgrade. CoreDNS is now `keel.sh/policy=never` +(`kubectl -n kube-system annotate deploy/coredns keel.sh/policy=never`). If a +future kubeadm minor ships a CoreDNS that DOES know the running version, drop the +`--skip-phases` for that run to let kubeadm re-take ownership. ### Vault secrets @@ -127,27 +224,54 @@ Exposed in K8s via ExternalSecret `k8s-upgrade-creds` in the `k8s-upgrade` names ## Common Operations -### Post-upgrade: restore apiserver OIDC (REQUIRED after any control-plane bump) +### apiserver OIDC + kubeadm upgrades (kubeadm-config reconciliation since 2026-06-24) `kubeadm upgrade apply` **regenerates `/etc/kubernetes/manifests/kube-apiserver.yaml` -and drops the `--authentication-config` flag**, silently disabling apiserver -OIDC (kubectl/kubelogin CLI **and** the web dashboard SSO break — tokens get -401). This is not auto-detected (the `rbac` stack's `null_resource` trigger is a -content hash that doesn't change). After any control-plane upgrade, re-apply: +from kubeadm-config**. apiserver auth uses a structured multi-issuer +`--authentication-config` (kubectl + dashboard SSO), but kubeadm-config used to +still carry the legacy single-issuer `--oidc-*` extraArgs — so every upgrade +reverted the flag, **silently breaking SSO after the upgrade** (the apiserver does +NOT crash on this — verified by isolated repro; it's recoverable via the restore +script below). NB: the **1.34→1.35 stall on 2026-06-24 was a *separate* issue — +etcd IO starvation**, not this drift; post-mortem: +`docs/post-mortems/2026-06-24-kubeadm-oidc-drift-apiserver-upgrade-stall.md`. + +**Primary fix (2026-06-24):** `stacks/rbac/modules/rbac/apiserver-oidc.tf` now +**reconciles kubeadm-config** (`kubeadm init phase upload-config kubeadm`, rewriting +`apiServer.extraArgs`: drop `--oidc-*`, add `--authentication-config`) as part of +its remote script. So kubeadm regenerates a **correct** manifest and the apiserver +upgrades with a pure image bump — `kubeadm upgrade diff <target>` shows only the +image change. Zero live impact (the CM is read only during an upgrade). + +**Backstops:** +- **Preflight check 4b** runs `kubeadm upgrade diff` and **alerts** (Slack WARN, does + NOT block — the drift only breaks SSO, which is recoverable) if + `--authentication-config` would still be dropped. +- The `rbac` stack still publishes its restore script to the + `kube-system/apiserver-oidc-restore` ConfigMap, and `phase_master` re-runs it on + master right after `kubeadm upgrade apply` (idempotent, `/livez`-gated with + auto-rollback, non-fatal) — now redundant belt-and-suspenders that *also* + re-reconciles kubeadm-config. Self-skips when master is already at target. + +**Manual fallback** — only for an out-of-band/manual `kubeadm` upgrade, or if the +chain logged `WARN: --authentication-config absent after re-apply`: ```bash cd stacks/rbac TF_VAR_ssh_private_key="$(cat ~/.ssh/id_ed25519)" \ VAULT_ADDR=https://vault.viktorbarzin.me ../../scripts/tg apply \ - --non-interactive -target=module.rbac.null_resource.apiserver_oidc_config + --non-interactive -target=module.rbac.null_resource.apiserver_oidc_config \ + -replace=module.rbac.null_resource.apiserver_oidc_config ``` -(`ssh_private_key` must be a key authorized for `wizard@<master>`; it is not yet -wired from Vault.) The provisioner re-writes `/etc/kubernetes/pki/auth-config.yaml` -(both `kubernetes` + `k8s-dashboard` issuers), re-adds the flag, and -health-gates `/livez` with auto-rollback. Verify: `curl -sk -https://localhost:6443/livez` on the master = `ok`, and the apiserver manifest -contains `--authentication-config`. See `docs/plans/2026-06-04-k8s-dashboard-sso-design.md`. +(`-replace` is **required** — the `null_resource` trigger is a content hash that +doesn't change, so a plain `-target` apply is a no-op. `ssh_private_key` must be a +key authorized for `wizard@<master>`.) The provisioner re-writes +`/etc/kubernetes/pki/auth-config.yaml` (both `kubernetes` + `k8s-dashboard` +issuers), re-adds the flag, and health-gates `/livez` with auto-rollback. Verify: +`curl -sk https://localhost:6443/livez` on the master = `ok`, and the apiserver +manifest contains `--authentication-config`. See +`docs/plans/2026-06-04-k8s-dashboard-sso-design.md`. ### Verify the pipeline is healthy ```bash @@ -202,8 +326,18 @@ EOF ``` ### Kill a stuck Job (chain halted mid-flight) -The chain stalls if any Job dies without spawning its successor. `K8sUpgradeStalled` -fires after 90 min. Recovery: +A phase Job that dies without spawning its successor halts the chain. Two alerts +surface it: `K8sUpgradeStalled` (a mid-chain Job that died with `in_flight=1`, +after 90 min) and `K8sUpgradeChainJobFailed` (any phase that terminally failed, +after 15 min — including a **preflight** that aborted before `in_flight` was set, +which `K8sUpgradeStalled` cannot see). + +**Preflight failures now self-heal** (since 2026-06-17): the detection CronJob and +`spawn_next` delete + re-spawn a terminally-Failed Job instead of skipping it on +name-existence (retry-on-failure), so a transient preflight gate — e.g. a spurious +critical alert like the ttyd web-terminal probe that wedged 1.34.9 for 5 days — +clears on the next daily cycle. A mid-chain phase that keeps failing still needs +manual recovery: fix the root cause, then: ```bash # 1. Identify the failed Job @@ -323,6 +457,13 @@ kill %1 ## Past Incidents +### 2026-06-18 — Preflight gate-4 wedged a partial (master-ahead) chain +- A prior 1.34.9 run upgraded k8s-master + k8s-node1, then stopped; node2-6 stayed on 1.34.8. +- Every nightly preflight then aborted at the **kubeadm-plan-target gate**: `kubeadm upgrade plan` runs on k8s-master, already on 1.34.9, so it emitted no `kubeadm upgrade apply vX.Y.Z` line → empty `plan_target` → `'' != '1.34.9'` → `exit 1`. Deterministic, not transient (gates 1-3 all green; no critical alert was firing). The failed preflight self-cleaned each night (2026-06-17 retry-on-failure) but re-failed identically. +- The two `in_flight`-based alerts stayed blind (preflight aborts pre-metric); `K8sUpgradeChainJobFailed` (warning) surfaced it. +- **Collateral**: the earlier master bump had also dropped apiserver `--authentication-config` (SSO broke); restored separately via the `rbac` stack's `apiserver_oidc_config`. +- **Mitigation**: `phase_preflight` now **skips the kubeadm-plan-target gate when k8s-master is already on TARGET_VERSION** (mirrors the at-target self-skip already in `phase_master`/`phase_worker`). Remaining workers are validated by their own phases; the detector's apt-cache probe already confirmed the target is installable. + ### 2026-05-11 — Self-preemption (agent → Job-chain rewrite) - The v1 agent ran inside the `claude-agent-service` Deployment (replicas=1, no nodeSelector) and was scheduled to k8s-node4. - During Stage 6 (first worker drain) the agent ran `kubectl drain k8s-node4` — evicting itself. @@ -336,6 +477,8 @@ kill %1 |------|-------| | Stack (CronJob + ConfigMaps + SA/RBAC + ExternalSecret) | `infra/stacks/k8s-version-upgrade/main.tf` | | Universal phase body | `infra/stacks/k8s-version-upgrade/scripts/upgrade-step.sh` | +| Compat gate (addon/API/containerd block logic) | `infra/stacks/k8s-version-upgrade/scripts/compat-gate.py` | +| Compat matrix (addon → highest supported k8s minor) | `infra/stacks/k8s-version-upgrade/scripts/addon-compat.json` | | Job template | `infra/stacks/k8s-version-upgrade/job-template.yaml` | | Per-node upgrade script | `infra/scripts/update_k8s.sh` | | Upgrade Gates alerts | `infra/stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl` (group "Upgrade Gates") | diff --git a/docs/runbooks/mailserver-pfsense-haproxy.md b/docs/runbooks/mailserver-pfsense-haproxy.md index 329be214..f0780caf 100644 --- a/docs/runbooks/mailserver-pfsense-haproxy.md +++ b/docs/runbooks/mailserver-pfsense-haproxy.md @@ -55,7 +55,7 @@ External mail (WAN) path — PROXY v2 │ pfSense WAN:{25,465,587,993} │ │ │ NAT rdr → 10.0.20.1:{same} │ │ ▼ │ -│ pfSense HAProxy (mode tcp, 4 frontends, 4 backend pools) │ +│ pfSense HAProxy (mode tcp, 5 frontends, 6 backend pools) │ │ │ data: send-proxy-v2 → :{30125..30128} (PROXY-aware pod) │ │ │ health: TCP-check → :{30145..30147} (no-PROXY pod) │ │ │ inter 5000 │ @@ -113,6 +113,28 @@ kubectl logs -c docker-mailserver deployment/mailserver -n mailserver \ # Expect external source IPs (e.g., Brevo 77.32.148.x), NOT 10.0.20.x ``` +## SNI-routed internal :443 frontend (2026-06-10) + +`internal_https_443` binds `10.0.20.1:443` + `10.0.10.1:443` and completes +the internal port table of the mail front door so `mail.viktorbarzin.me` +(internal A record → 10.0.20.1) serves webmail too. Routing (Viktor's +design — route by what the client asked for): + +| Client connects with | Routed to | +|---|---| +| SNI = `pfsense.viktorbarzin.{lan,me}` | webgui backend `127.0.0.1:8443` | +| any other SNI (hostnames, e.g. `mail.…`) | Traefik `10.0.20.203:443`, send-proxy-v2 | +| no SNI (bare IP — `https://10.0.20.1`) | webgui backend `127.0.0.1:8443` | + +The **pfSense webGUI was moved to `:8443`** (config.xml +`system.webgui.port`, 2026-06-10) to free the 443 socket; admin access by +IP keeps working through the no-SNI route, and `:8443` remains a direct +fallback if HAProxy is down. The `pfsense.viktorbarzin.me` Traefik ingress +(stacks/reverse-proxy) targets `:8443` directly. Traefik leg mirrors the +IPv6 bridge: send-proxy-v2 (Traefik trusts 10.0.20.1), **no health check** +(PROXY-expecting receivers reject bare probes — gotcha above). All of this +is declared in `pfsense-haproxy-bootstrap.php` — re-run to reset. + ## Bootstrap / restore from scratch pfSense HAProxy config lives in `/cf/conf/config.xml` under diff --git a/docs/runbooks/offboard-user.md b/docs/runbooks/offboard-user.md index 104f4fcd..05c0c5a8 100644 --- a/docs/runbooks/offboard-user.md +++ b/docs/runbooks/offboard-user.md @@ -29,7 +29,26 @@ gated `userdel_archive`, which is **never** auto-applied). sudo systemctl disable --now t3-serve@<os_user>.service sudo passwd -l <os_user> ``` -4. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302 → Authentik, then +4. **Revoke git + group access** *(manual)*: + ```bash + # legacy secret-bearing group, if they were ever in it + sudo gpasswd -d <os_user> code-shared + # drop write access to the infra repo + curl -X DELETE -H "Authorization: token <admin_pat>" \ + https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/collaborators/<forgejo_login> + # if they were whitelisted for direct master push, remove them from the + # branch-protection whitelists (PATCH with the remaining usernames) + curl -X PATCH -H "Authorization: token <admin_pat>" -H 'Content-Type: application/json' \ + https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/branch_protections/master \ + -d '{"push_whitelist_usernames":["viktor"],"merge_whitelist_usernames":["viktor"]}' + # revoke their devvm git PAT (token name: devvm-infra-git; admin PAT may + # manage other users' tokens — verified 2026-06-10; the CLI has no delete) + curl -X DELETE -H "Authorization: token <admin_pat>" \ + https://forgejo.viktorbarzin.me/api/v1/users/<forgejo_login>/tokens/devvm-infra-git + ``` + Note: their already-running sessions keep dropped groups until cycled — restart + `t3-serve@<os_user>` to enforce immediately. +5. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302 → Authentik, then denied once removed from the `T3 Users` group — Part C) and cannot log in. Nothing is deleted; re-adding the roster entry + reconcile fully restores them. diff --git a/docs/runbooks/pfsense-unbound.md b/docs/runbooks/pfsense-unbound.md index 19e0e5dc..bfa2a19e 100644 --- a/docs/runbooks/pfsense-unbound.md +++ b/docs/runbooks/pfsense-unbound.md @@ -93,6 +93,55 @@ Verify via the Technitium API: curl -sk "http://127.0.0.1:5380/api/zones/options/get?token=$TOK&zone=viktorbarzin.lan" | jq .response.zoneTransfer ``` +## Domain Override: viktorbarzin.me → Technitium (2026-06-10) + +`$config['unbound']['domainoverrides']` carries one entry forwarding the +whole `viktorbarzin.me` zone to Technitium `10.0.20.201` (forward-zone, not +AXFR). Every Unbound client — all VLANs + 192.168.1.x via the WAN listener — +gets Technitium's internal split-horizon answers: ingress hosts CNAME to the +zone apex whose A record auto-tracks the live Traefik LB IP +(`technitium-ingress-dns-sync` + `viktorbarzin-apex-probe` canary). This is +what keeps kubelet forgejo image pulls (and everything else on 10.0.x) off +the broken public NAT-hairpin with zero per-host DNS config — see +`docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`. + +Notes: + +- The domain is NOT DNSSEC-signed (no DS records), so no `domain-insecure` + needed; private-IP answers pass without `private-domain` custom options + (verified empirically — pfSense handles domain overrides correctly). +- **Cluster-outage behavior**: the zone SERVFAILs while Technitium is down + (forward-zone, no local copy). Deliberate — the services are down anyway. + Contrast with `viktorbarzin.lan`, which is AXFR-slaved to survive outages. +- **In-cluster pods must NOT see these answers** (Traefik LB is ETP=Local, + unreachable from pods). CoreDNS has a dedicated `viktorbarzin.me:53` + carve-out (stacks/technitium) — do not remove it while this override exists. +- Added with the standard SSH + PHP pattern (see "host override" memories / + this file's style): + +```php +require_once("config.inc"); require_once("unbound.inc"); +global $config; +$config["unbound"]["domainoverrides"][] = [ + "domain" => "viktorbarzin.me", "ip" => "10.0.20.201", + "descr" => "...", "tls_hostname" => "", +]; +write_config("add viktorbarzin.me domain override -> Technitium"); +services_unbound_configure(); +``` + +Rollback: remove the entry from the array (match on `domain`), then +`write_config()` + `services_unbound_configure()`. Pre-change backup: +`/cf/conf/config.xml.bak-2026-06-10-pre-me-forward` (on-box). + +Verify: + +``` +dig +short @10.0.20.1 forgejo.viktorbarzin.me # apex CNAME + live Traefik IP +dig +short @10.0.20.1 mail.viktorbarzin.me # 10.0.20.1 (non-Traefik record) +dig +short @10.0.20.1 google.com # public, unaffected +``` + ## Operational Checks ```bash diff --git a/docs/runbooks/t3-drop-attribution.md b/docs/runbooks/t3-drop-attribution.md new file mode 100644 index 00000000..df4cef09 --- /dev/null +++ b/docs/runbooks/t3-drop-attribution.md @@ -0,0 +1,126 @@ +# t3 drop attribution — "is it infra or my network?" + +When a t3 user reports "disconnects, then self-recovers after a few seconds", +that is the t3 **client watchdog**: the browser heartbeats every 10s and force- +reconnects after >20s without a response. Any stall or break anywhere on +browser → Cloudflare → tunnel → Traefik → t3-dispatch → `t3 serve` produces +the identical symptom. This runbook attributes a drop to a segment in minutes. + +## 1. Check the probe (first stop) + +The in-cluster `t3-probe` (stacks/t3code, scrape job `t3-probe`) holds three +permanent legs that differ only in path segment: + +| leg | path under test | drop means | +|---|---|---| +| `cloudflare` | WAN → CF edge → tunnel → cloudflared → Traefik → dispatch | Cloudflare/WAN segment | +| `internal` | Traefik LB (10.0.20.203) → dispatch (no Cloudflare) | Traefik / dispatch / devvm network | +| `t3serve` | HTTP straight to devvm:3773 (`t3 serve` process) | the serve process itself (event-loop stall) | + +Prometheus queries: + +```promql +increase(t3probe_disconnects_total[1h]) # drops per leg+reason +t3probe_connected # current state per leg +histogram_quantile(0.99, rate(t3probe_rtt_seconds_bucket{leg="t3serve"}[15m])) +``` + +Attribution table: + +- `cloudflare` drops, `internal` clean → Cloudflare edge / QUIC tunnel / WAN. +- both WS legs drop together → Traefik, dispatch, or devvm reachability. +- `t3serve` RTT spikes / timeouts → the user's `t3 serve` stalled (see §3). +- **all legs clean while the user dropped → their last mile / device. Infra + is exonerated, with data.** + +Alerts `T3ProbeLegDown` / `T3ProbeDropBurst` fire on sustained breakage. + +## 1b. Connection logs in Loki (passive, always-on — catch a real drop) + +Three layers of the real path log every t3 `/ws` connection to Loki, so a drop +the user actually experienced is attributable after the fact without a repro. A +drop is **a short-lived `/ws` connection** (a healthy session holds one socket +for hours); the client's 20s heartbeat watchdog reconnects on any break. + +| Layer | Loki stream | What it tells you | +|---|---|---| +| Traefik | `{job="traefik"}` ⟶ filter `t3code-t3` + `GET /ws` | per-connection **duration** (trailing `…ms`) + edge (cloudflared pod) IP | +| cloudflared | `{job="cloudflared"}` ⟶ filter `t3.viktorbarzin.me/ws` | CF-tunnel-side close (`ended abruptly: context canceled` = browser/CF side hung up) | +| t3-dispatch | `{job="devvm-journal",unit="t3-dispatch.service"} \|= "ws close"` | **`dur_ms` + `cause`** — the discriminator below | + +`cause` on the dispatch `ws close` line: +- **`downstream_closed`** — client / Cloudflare / Traefik tore the socket down + (`context canceled`). Short `dur_ms` = client watchdog firing → a **last-mile / + network-quality** drop (or CF/tunnel blip); t3-serve was fine. +- **`upstream_closed`** — the user's `t3 serve` closed/reset (reset by peer / EOF + / refused) → t3-serve stall/restart/OOM. +- **`graceful`** — clean close from either side (e.g. the client watchdog's + `disconnect()` after a >20s heartbeat gap). Cross-check `dur_ms`: a ~20s+ + graceful close with no devvm pressure spike (§3) is a heartbeat-timeout whose + stall was NOT on devvm → last-mile. + +Triage query (Grafana Explore → Loki) — every short t3 socket in a window: + +```logql +{job="devvm-journal", unit="t3-dispatch.service"} |= "ws close" + | regexp `dur_ms=(?P<dur>[0-9]+) cause=(?P<cause>\S+)` | dur < 120000 +``` + +Line the timestamp up against `{job="traefik"}` (duration + edge IP) and +`{job="cloudflared"}` (CF-side close) for the same second to localise the layer. +devvm journald (incl. `t3-serve@<user>`) ships via `scripts/devvm-promtail.*`. + +## 2. Server-side log recipe (per-event forensics) + +On devvm (timestamps in UTC): + +```bash +# dispatch view — error class identifies which side died: +# "context canceled" = front/client side tore down +# "connection reset by peer 127.0.0.1:PORT" = that user's serve closed +# "connection refused" = that user's serve was down +journalctl -u t3-dispatch --since "1 hour ago" | grep "proxy error" + +# mass-cancel bursts (many same-second cancels = shared-segment break): +journalctl -u t3-dispatch --since "6 hours ago" \ + | grep -oE '^.* [0-9:]+ http: proxy error: context canceled' \ + | awk '{print $6}' | sort | uniq -c | awk '$1>=5' + +# serve-side starvation markers (git taking >5s = devvm frozen): +journalctl -u t3-serve@<user> --since "6 hours ago" | grep "timed out" + +# tunnel-side: cloudflared pod restarts + per-connection events +kubectl -n cloudflared get pods +kubectl -n cloudflared logs <pod> --since=6h | grep -E "ERR|reconnect" +``` + +## 3. devvm pressure correlation + +devvm node_exporter is scraped as job `devvm` (since 2026-06-10). The known +high-frequency drop mechanism is **memory+IO pressure on devvm**: agent +processes live inside `t3-serve@<user>`'s cgroup; a runaway agent swap-thrashes +the spinning root disk and freezes the box in multi-10s windows — every +connected client's watchdog fires at once (2026-06-10: a 10.8G agent → global +OOM → 8.5min hard outage). + +```promql +rate(node_pressure_io_stalled_seconds_total{instance="devvm"}[5m]) +rate(node_pressure_memory_stalled_seconds_total{instance="devvm"}[5m]) +node_memory_SwapFree_bytes{instance="devvm"} +``` + +Guardrails in place (2026-06-10, `scripts/t3-serve@.service`): per-unit +`MemoryHigh=12G`, `MemoryMax=16G`, `MemorySwapMax=0`, `OOMPolicy=continue` — +a runaway agent now OOMs alone inside the cgroup instead of taking the box +(and the WS server) with it. + +## 4. Known root causes (2026-06-10 investigation) + +1. **devvm memory/IO storms** (high-frequency mechanism) — §3. +2. **cloudflared in-place autoupdate** — fixed: `--no-autoupdate` + (stacks/cloudflared). Before the fix every CF release exited all 3 pods + (code 11), severing all tunnel WebSockets. +3. **QUIC tunnel churn** (~1–2/day, "no recent network activity") — inherent; + visible as `cloudflare`-leg-only blips. +4. **t3 nightly autoupdate** — pinned after the 2026-06-09 outage, see + `docs/post-mortems/2026-06-09-t3-nightly-autoupdate-auth-outage.md`. diff --git a/docs/runbooks/t3-version-bump.md b/docs/runbooks/t3-version-bump.md new file mode 100644 index 00000000..cf8359e5 --- /dev/null +++ b/docs/runbooks/t3-version-bump.md @@ -0,0 +1,117 @@ +# Runbook: t3 version — gated nightly tracker (freeze / revert / roll back) + +t3 on the devvm **auto-tracks the `nightly` npm dist-tag** (Viktor, 2026-06-16, +risk explicitly accepted), via the daily `t3-autoupdate` timer. Every bump is +GATED so a bad nightly self-heals instead of repeating 2026-06-09. This reverses +the post-incident pin decision — read `2026-06-09-t3-nightly-autoupdate-auth-outage.md` +for why every guard below exists. t3 is still pre-1.0 and ships breaking changes +between builds; the gate is what makes auto-tracking safe. + +## How the tracker gates each bump (`scripts/t3-autoupdate.sh`) + +1. **Freeze gate** — `/etc/t3-autoupdate.freeze` present (or `T3_PIN=<ver>` set) → + hold at current, do nothing. +2. **Resolve + downgrade-guard** — `npm view t3@nightly version`; proceed only if + the target is strictly newer than installed AND a `-nightly.` build (the tag is + mutable and can point backward). +3. **Pre-bump backup** — online `VACUUM INTO` of every user's `state.sqlite` to + `/var/backups/t3-state/<u>/state-prebump-<ver>-<ts>.sqlite` (runs AS the owner; + never stops a serve). Rollback is then a RESTORE, not sqlite surgery. +4. **Install + health-check** — `npm i -g t3@<ver>`, then start a throwaway serve + SEEDED WITH A COPY of wizard's real populated `state.sqlite` (scratch on + `/var/tmp`, not the 2 GB tmpfs `/tmp`) so it exercises the forward MIGRATION + (the 2026-06-09 failure class) + the real mint→exchange→`t3_session` pairing + handshake. Fail → roll back binary to last-good, exit (no serve migrated yet → + clean). +5. **Canary rollout** — restart IDLE instances one at a time, verifying pairing + through the real dispatch after each. First failure → roll back binary + + restore that user's DB from the pre-bump backup + **self-freeze** (touch the + freeze file) so it cannot re-flap onto bad builds. Active-agent instances are + DEFERRED (never killed) and migrate on their next idle restart. +6. **Last-good** — advanced to the new version only on full success + (`/var/lib/t3-autoupdate/last-good`); it is the rollback target. + +Detection backstop (real-user pairing failures / endpoint fallback): the dispatch +logs every outcome (`paired user=.. endpoint=.. fallback=..`, plus `mint/pairing +... failed`) → Loki alerts `T3PairingBroken` / `T3PairFallbackHigh` / +`T3AutoUpdateRolledBack` / `T3AutoUpdateRollbackFailed` / `T3AutoUpdateFrozen` → +Alertmanager → Slack. + +## Idle migrator — draining deferrals (`scripts/t3-migrate-idle.sh`) + +Step 5 DEFERS any instance with an active agent, recording `/var/lib/t3-autoupdate/deferred/<user>` (= the target version). Without a drainer, a user busy at every 04:00 window never migrates and their client shows *"Client and server versions differ"* for days. `t3-migrate-idle.timer` (overnight, every 20 min 01:00–05:40) drains those markers: + +- Per marker: skip + clear if the unit is gone or was already restarted *after* the deferral; otherwise restart the still-stale `t3-serve@<u>` onto the current binary **only when that user is idle** — `state.sqlite` shows zero `active_turn_id` (no in-flight turn) AND ≥ `T3_MIGRATE_QUIET_SECONDS` (default 900 = 15 min) since the last thread activity — then verify pairing and clear the marker. **Fail-closed:** any query/parse doubt → skip, retry next tick. +- It restarts via the SAME `safe_restart_unit` the daily canary uses (sourced `t3-safe-restart.sh`: backup → restart → verify → recover). The shared `/etc/t3-autoupdate.freeze` halts it too. +- **Force / preview:** + ```bash + sudo systemctl start t3-migrate-idle.service # run a drain pass now (still idle-gated) + sudo env T3_DRY_RUN=1 /usr/local/bin/t3-migrate-idle # log decisions, act on nothing + ``` +- **Rare-tail failure:** if a deferred user's forward migration fails at idle restart (already gated against a copy of their real DB at install), `safe_restart_unit` restores their DB + freezes + alerts. The binary rollback is a no-op (the build was already accepted, so other users are unaffected), but that user's serve may crashloop on the restored DB until the freeze is cleared and the build investigated (manual rollback below). + +## Operations + +**Freeze / revert (stop tracking right now — the fast "make it stop"):** +```bash +sudo touch /etc/t3-autoupdate.freeze # holds at the current build; next run is a no-op + fires T3AutoUpdateFrozen +sudo rm -f /etc/t3-autoupdate.freeze # resume tracking +``` + +**Pin to an exact version (instead of tracking nightly):** set `T3_PIN=<ver>` in +the unit environment (or the `scripts/t3-autoupdate.sh` default) — the tracker +enforces it and stops following nightly. Keep in sync with `setup-devvm.sh`. + +**Preview the current nightly without touching anything (no global change, no restarts):** +```bash +sudo T3_DRY_RUN=1 /usr/local/bin/t3-autoupdate # installs candidate to a temp prefix, runs the full gate, reports PASS/FAIL +``` + +**Force a run now (instead of waiting for 04:00):** +```bash +sudo systemctl start t3-autoupdate.service # runs in its own cgroup, isolated from the t3-serve@ instances it manages +``` + +## What a bump touches (still true) + +1. **Pairing API** — t3 renamed `POST /api/auth/bootstrap` → `/api/auth/browser-session` + in 0.0.25. `t3-dispatch` is version-agnostic (`pairEndpoints` in + `scripts/t3-dispatch/main.go` tries browser-session, falls back to bootstrap). + If a future build renames it AGAIN, the health-check + canary fail the bump and + self-freeze — then add the new path to `pairEndpoints`, rebuild + redeploy the + dispatch, and clear the freeze. +2. **Schema** — 0.0.25+ migrate every `~/.t3/userdata/state.sqlite` FORWARD — a + **one-way door**. A binary downgrade alone does NOT roll it back; you must + restore the DB. The tracker does this automatically on a canary failure; do it + by hand (below) if a problem surfaces *after* a successful bump. + +## Manual rollback (problem surfaces after a bump the gate let through) + +```bash +GOOD=$(cat /var/lib/t3-autoupdate/last-good) # or the known-good version you want +sudo touch /etc/t3-autoupdate.freeze # stop the tracker FIRST +sudo npm i -g "t3@$GOOD" +# Restore + restart each user's serve. The wizard/admin instance: run this from +# OUTSIDE its own t3 session (stopping the serve you're running inside kills you); +# or just let it pick up $GOOD on its next natural restart. +for u in $(awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/ /,"",$2);print $2}' /etc/ttyd-user-map | sort -u); do + bak=$(sudo ls -1t /var/backups/t3-state/$u/state-prebump-* 2>/dev/null | head -1) + [ -n "$bak" ] || continue + sudo systemctl stop t3-serve@$u + sudo install -o "$u" -g "$u" -m600 "$bak" /home/$u/.t3/userdata/state.sqlite + sudo rm -f /home/$u/.t3/userdata/state.sqlite-wal /home/$u/.t3/userdata/state.sqlite-shm + sudo systemctl start t3-serve@$u +done +``` + +## Verify (any user pairs cleanly through the dispatch) + +```bash +for u in vbarzin emil.barzin ancaelena98; do + curl -sI -H "X-authentik-username: $u" http://10.0.10.10:3780/ | grep -iE 'HTTP/|set-cookie: t3_session' +done # each must be 302 + t3_session +t3 --version +``` + +(The 2026-06-09 incident had no pre-bump backup, so rollback meant per-user sqlite +surgery. The tracker now takes a guaranteed pre-bump snapshot — rollback is a restore.) diff --git a/modules/create-template-vm/cloud_init.yaml b/modules/create-template-vm/cloud_init.yaml index fcd634e3..11a86b6e 100644 --- a/modules/create-template-vm/cloud_init.yaml +++ b/modules/create-template-vm/cloud_init.yaml @@ -8,6 +8,13 @@ users: sudo: ALL=(ALL) NOPASSWD:ALL ssh_authorized_keys: - ${authorized_ssh_key} + # k8s-upgrade pipeline key (matches Vault secret/k8s-upgrade/ssh_key_pub). + # The automated k8s-version-upgrade chain SSHes in as `wizard` to drain + + # upgrade each node; WITHOUT this a freshly-provisioned node is invisible + # to the upgrade pipeline (node4/5/6 hit exactly this — Permission denied — + # 2026-06-17). Hardcoded: it's a public key and the keypair is stable; if + # it's ever rotated, update this line and Vault together. + - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIElH9x76UNA8UNxrxTjREYz4hz1fbCdRwAXbOkJ5FnSM k8s-upgrade-pipeline passwd: ${passwd} lock_passwd: false # enable passwd login shell: /bin/bash @@ -90,20 +97,24 @@ runcmd: - sed -i 's/#Compress=yes/Compress=yes/' /etc/systemd/journald.conf - systemctl restart systemd-journald %{if is_k8s_template} - # systemd-resolved global DNS fallback. Without this, only the - # link-level DNS from Proxmox's `qm set --nameserver` (Technitium, - # 10.0.20.201) is consulted — and Technitium returns NXDOMAIN for - # forgejo.viktorbarzin.me, so kubelet image pulls from the Forgejo - # registry break. Public DNS upstream + Technitium fallback matches - # the pre-existing manual setup on k8s-node1..4. - - mkdir -p /etc/systemd/resolved.conf.d - - | - cat > /etc/systemd/resolved.conf.d/global-dns.conf <<'EOF' - [Resolve] - DNS=8.8.8.8 1.1.1.1 - FallbackDNS=10.0.20.201 - EOF - - systemctl restart systemd-resolved + # Node DNS is intentionally STOCK — no resolved drop-ins, no /etc/hosts + # pins. Link nameservers come from Proxmox `qm set --nameserver + # "10.0.20.1 94.140.14.14"` (pfSense + public secondary; set this when + # cloning a new node VM). Internal split-horizon for *.viktorbarzin.me + # happens at pfSense Unbound: a domain override forwards the zone to + # Technitium (10.0.20.201), whose split-horizon zone CNAMEs every ingress + # host to the apex A record that auto-tracks the live Traefik LB IP — so + # every VLAN client, nodes included, gets internal answers with zero + # per-host config (2026-06-10; runbook: docs/runbooks/pfsense-unbound.md). + # Pods get the SAME internal answers via CoreDNS's `viktorbarzin.me:53` + # block forwarding to the Technitium ClusterIP (+ forgejo pinned to + # Traefik's ClusterIP for Technitium-outage resilience; stacks/technitium. + # Pods reach the ETP=Local LB IP fine on k8s 1.34 — verified 2026-06-10). + # History: a global-dns.conf drop-in (public DNS primary) lived here until + # 2026-06-10. Its rationale ("Technitium NXDOMAINs forgejo.viktorbarzin.me") + # had long been obsolete, and it steered fresh forgejo pulls onto the broken + # public NAT-hairpin (7.5h tuya-bridge outage — see + # docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md). # Re-enabled 2026-05-10: unattended-upgrades is back on, but with a tight # Allowed-Origins list, a Package-Blacklist for k8s/containerd/runc/calico, # and Automatic-Reboot disabled (kured + sentinel-gate handles reboots in a diff --git a/modules/create-template-vm/k8s-node-containerd-setup.sh b/modules/create-template-vm/k8s-node-containerd-setup.sh index 20054876..98a71739 100755 --- a/modules/create-template-vm/k8s-node-containerd-setup.sh +++ b/modules/create-template-vm/k8s-node-containerd-setup.sh @@ -49,10 +49,16 @@ server = "https://ghcr.io" capabilities = ["pull", "resolve"] GHCR -# Forgejo OCI registry: prefer in-cluster Traefik LB (10.0.20.200) to -# avoid hairpin NAT. Traefik serves the *.viktorbarzin.me wildcard so -# SNI verification succeeds. If the mirror is unreachable, fall back to -# public DNS resolution (needs the global DNS fallback set up below). +# Forgejo OCI registry. NOTE: this hosts.toml mirror is VESTIGIAL — it +# cannot keep pulls off the public hairpin on its own (Traefik routes by +# Host/SNI and 404s the mirror's bare-IP requests, and the registry's +# Bearer auth realm is the absolute https://forgejo.viktorbarzin.me/v2/token +# URL fetched outside the mirror). What actually keeps forgejo pulls +# internal is the pfSense Unbound domain override forwarding +# viktorbarzin.me -> Technitium, whose split-horizon zone serves the live +# Traefik LB IP (no node-side DNS config at all). +# Kept for config uniformity; harmless. See +# docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md. mkdir -p /etc/containerd/certs.d/forgejo.viktorbarzin.me cat > /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml <<'FORGEJO' server = "https://forgejo.viktorbarzin.me" diff --git a/modules/kubernetes/ingress_factory/main.tf b/modules/kubernetes/ingress_factory/main.tf index 0f239fb4..fc9bc9f5 100644 --- a/modules/kubernetes/ingress_factory/main.tf +++ b/modules/kubernetes/ingress_factory/main.tf @@ -107,10 +107,6 @@ variable "custom_content_security_policy" { type = string default = null } -variable "exclude_crowdsec" { - type = bool - default = false -} variable "full_host" { type = string default = null @@ -310,7 +306,6 @@ resource "kubernetes_ingress_v1" "proxied-ingress" { "traefik-error-pages@kubernetescrd", var.skip_default_rate_limit ? null : "traefik-rate-limit@kubernetescrd", var.custom_content_security_policy == null ? "traefik-csp-headers@kubernetescrd" : null, - var.exclude_crowdsec ? null : "traefik-crowdsec@kubernetescrd", local.effective_anti_ai ? "traefik-ai-bot-block@kubernetescrd" : null, local.effective_anti_ai ? "traefik-anti-ai-headers@kubernetescrd" : null, local.auth_middleware, diff --git a/scripts/apply-mbps-caps.sh b/scripts/apply-mbps-caps.sh index 39bfb36b..5358f7db 100755 --- a/scripts/apply-mbps-caps.sh +++ b/scripts/apply-mbps-caps.sh @@ -27,6 +27,12 @@ TARGETS=( "220:scsi0:40:40" # docker-registry ) +# Sort a disk spec's comma-separated options so two specs with the same +# option set but different key order compare equal. +normalized() { + tr ',' '\n' <<<"$1" | LC_ALL=C sort | paste -sd, - +} + apply_one() { local spec="$1" local vmid slot rd wr @@ -49,8 +55,13 @@ apply_one() { newvalue="${cleaned},mbps_rd=${rd},mbps_wr=${wr}" # Skip the qm-set call entirely when state already matches — keeps - # journal noise low under the hourly timer. - if [[ "$current" == "$newvalue" ]]; then + # journal noise low under the hourly timer. Compare option SETS, not raw + # strings: `qm config` prints keys in its own canonical order, so a raw + # compare never matched and every hourly run re-issued `qm set`, which + # live-rewrites the running VM's QEMU throttle state via QMP (implicated + # in the 2026-06-11 devvm I/O stall — see + # docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md). + if [[ "$(normalized "$current")" == "$(normalized "$newvalue")" ]]; then echo "vmid $vmid: $slot already at mbps_rd=${rd},mbps_wr=${wr} — no-op" return 0 fi diff --git a/scripts/breakglass-firewall.sh b/scripts/breakglass-firewall.sh new file mode 100644 index 00000000..51260cb9 --- /dev/null +++ b/scripts/breakglass-firewall.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash +set -euo pipefail +# Break-glass base firewall (redesigned 2026-06-11; replaced the port-knock gate). +# +# Source of truth. Deploy to the PVE host with: +# scp scripts/breakglass-firewall.sh root@192.168.1.127:/usr/local/sbin/breakglass-firewall.sh +# ssh root@192.168.1.127 'chmod 0755 /usr/local/sbin/breakglass-firewall.sh && systemctl restart breakglass-firewall.service' +# The breakglass-firewall.service oneshot runs this at boot (RemainAfterExit). +# +# Model: key-only SSH break-glass on :52222, openly reachable from the WAN, NO +# port-knock. The SSH key is the gate (brute-force-proof); the rate-limit below +# only trims scanner noise / slows a hypothetical sshd 0-day. +# :22 -> LAN admin (all of root's keys), always allowed. +# :52222 -> WAN break-glass. LAN/VLAN sources bypass the limit; external NEW +# connections are rate-limited per source IP, then accepted. +iptables -N BREAKGLASS 2>/dev/null || iptables -F BREAKGLASS +iptables -C INPUT -j BREAKGLASS 2>/dev/null || iptables -I INPUT 1 -j BREAKGLASS + +iptables -A BREAKGLASS -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT +iptables -A BREAKGLASS -p tcp --dport 22 -j ACCEPT +iptables -A BREAKGLASS -p tcp --dport 52222 -s 192.168.1.0/24 -j ACCEPT +iptables -A BREAKGLASS -p tcp --dport 52222 -s 10.0.0.0/8 -j ACCEPT +iptables -A BREAKGLASS -p tcp --dport 52222 -m conntrack --ctstate NEW \ + -m hashlimit --hashlimit-name bg_ssh --hashlimit-mode srcip \ + --hashlimit-above 6/min --hashlimit-burst 3 -j DROP +iptables -A BREAKGLASS -p tcp --dport 52222 -j ACCEPT diff --git a/scripts/claude-auth-sync@.service b/scripts/claude-auth-sync@.service new file mode 100644 index 00000000..3750f295 --- /dev/null +++ b/scripts/claude-auth-sync@.service @@ -0,0 +1,20 @@ +[Unit] +Description=Validate and back up Claude OAuth credentials for %i +Documentation=https://github.com/ViktorBarzin/infra/blob/master/docs/runbooks/claude-auth-renew-workstation.md +Wants=network-online.target +After=network-online.target + +[Service] +Type=oneshot +User=%i +Group=%i +Environment=HOME=/home/%i +Environment=PATH=/usr/local/bin:/usr/bin:/bin:/home/%i/.local/bin +ExecStart=/usr/local/bin/claude-auth-sync + +# Credential and Vault access are required; keep the remaining host surface narrow. +NoNewPrivileges=true +PrivateTmp=true +ProtectSystem=strict +ProtectHome=read-only +ReadWritePaths=-/home/%i/.claude -/home/%i/.claude.json -/home/%i/.config/claude-auth-sync -/home/%i/.local/state/claude-auth-sync diff --git a/scripts/claude-auth-sync@.timer b/scripts/claude-auth-sync@.timer new file mode 100644 index 00000000..b25f2ecd --- /dev/null +++ b/scripts/claude-auth-sync@.timer @@ -0,0 +1,12 @@ +[Unit] +Description=Keep Claude OAuth credentials valid and recoverable for %i + +[Timer] +OnBootSec=10m +OnUnitActiveSec=6h +Persistent=true +RandomizedDelaySec=10m +Unit=claude-auth-sync@%i.service + +[Install] +WantedBy=timers.target diff --git a/scripts/cluster_healthcheck.sh b/scripts/cluster_healthcheck.sh index 49fbc61f..51a13b5d 100755 --- a/scripts/cluster_healthcheck.sh +++ b/scripts/cluster_healthcheck.sh @@ -1093,7 +1093,7 @@ check_helm_releases() { section 18 "Helm Release Health" local releases detail="" had_issue=false status="PASS" - releases=$(helm list -A --kubeconfig "$KUBECONFIG_PATH" -o json 2>/dev/null) || { + releases=$(helm list -A -a --kubeconfig "$KUBECONFIG_PATH" -o json 2>/dev/null) || { [[ "$QUIET" == true ]] && section_always 18 "Helm Release Health" warn "Cannot list Helm releases" json_add "helm_releases" "WARN" "Cannot list" @@ -1108,9 +1108,14 @@ for r in data: name = r.get("name", "?") ns = r.get("namespace", "?") st = r.get("status", "unknown") - if st != "deployed": - level = "FAIL" if st.startswith("pending") else "WARN" - print(f"{level}:{ns}/{name}:{st}") + # helm list -a (above) surfaces pending-*/failed releases that plain + # `helm list` HIDES; a stuck pending-upgrade silently blocks every + # terragrunt apply of the stack (2026-06-16 prometheus incident, ~4 days + # of frozen monitoring config). Ignore deployed/uninstalled/superseded. + if st.startswith("pending"): + print(f"FAIL:{ns}/{name}:{st}") + elif st == "failed": + print(f"WARN:{ns}/{name}:{st}") ' 2>/dev/null) || true if [[ -z "$bad_releases" ]]; then @@ -2026,11 +2031,16 @@ check_hardware_exporters() { fi done - # Check Prometheus scrape targets for hardware exporters + # Check Prometheus scrape targets for hardware exporters. + # last_over_time(up[15m]) instead of instant `up`: the redfish-idrac + # remnant scrapes every 10m (> the 5m staleness window), so an instant + # query returns it EMPTY ~half the time -> intermittent false "missing" + # (observed 2026-06-10). 15m covers the slowest job; identical answers + # for the 1-2m jobs. local prom_jobs=("snmp-idrac" "snmp-ups" "redfish-idrac" "proxmox-host") local up_result up_result=$($KUBECTL exec -n monitoring deploy/prometheus-server -- \ - wget -q -O- 'http://localhost:9090/api/v1/query?query=up' 2>/dev/null || true) + wget -q -O- 'http://localhost:9090/api/v1/query?query=last_over_time(up%5B15m%5D)' 2>/dev/null || true) if [[ -n "$up_result" ]]; then for job in "${prom_jobs[@]}"; do diff --git a/scripts/devvm-promtail.service b/scripts/devvm-promtail.service new file mode 100644 index 00000000..c3bc5c79 --- /dev/null +++ b/scripts/devvm-promtail.service @@ -0,0 +1,17 @@ +# systemd unit for promtail on the devvm (10.0.10.10). Install to +# /etc/systemd/system/promtail.service. See scripts/devvm-promtail.yaml for the full deploy. +[Unit] +Description=Promtail (ships devvm journal -> cluster Loki) +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml +Restart=on-failure +RestartSec=5 +User=root +Group=root + +[Install] +WantedBy=multi-user.target diff --git a/scripts/devvm-promtail.yaml b/scripts/devvm-promtail.yaml new file mode 100644 index 00000000..fac66fad --- /dev/null +++ b/scripts/devvm-promtail.yaml @@ -0,0 +1,59 @@ +# Promtail config for the devvm (10.0.10.10) — ships the systemd journal to cluster Loki. +# +# devvm is a standalone VM (NOT a k8s node), so its journal — including the t3 +# stack (t3-dispatch, t3-serve@<user>) — was never in Loki. Added 2026-06-11 for +# t3 drop forensics: t3-dispatch now logs each /ws connection's open/close with +# duration + which side hung up (downstream_closed = client/CF/Traefik went away; +# upstream_closed = t3-serve closed/stalled; graceful = clean close). Joined with +# Traefik's per-/ws duration (already in Loki) this attributes every drop to a layer. +# +# NOT Terraform-managed (devvm is outside k8s) — same hand-deployed pattern as +# scripts/pve-promtail.* and the rpi-sofia promtail. This file is source-of-truth. +# +# Deploy (on devvm, as root via sudo): +# sudo install -d -m 0755 /etc/promtail /var/lib/promtail +# sudo install -m 0644 scripts/devvm-promtail.yaml /etc/promtail/config.yml +# sudo install -m 0644 scripts/devvm-promtail.service /etc/systemd/system/promtail.service +# # Binary: grafana/loki v3.5.1 promtail-linux-amd64 -> /usr/local/bin/promtail (chmod 0755). +# sudo systemctl daemon-reload && sudo systemctl enable --now promtail +# # Loki reach: loki.viktorbarzin.lan (Technitium CNAME -> live Traefik LB; insecure cert). +# +# Streams produced: +# {job="devvm-journal"} — full devvm journal +# {job="devvm-journal", unit="t3-dispatch.service"} — dispatch (ws open/close lines) +# {job="devvm-journal", unit="t3-serve@wizard.service"} — per-user t3 serve +# {job="sshd-devvm"} — sshd auth lines (parity with sshd-pve) +server: + http_listen_port: 9080 + grpc_listen_port: 0 + log_level: warn + +positions: + filename: /var/lib/promtail/positions.yaml + +clients: + - url: https://loki.viktorbarzin.lan/loki/api/v1/push + tls_config: + insecure_skip_verify: true + +scrape_configs: + - job_name: journal + journal: + max_age: 12h + json: false + path: /var/log/journal + labels: + host: devvm + job: devvm-journal + relabel_configs: + - source_labels: ['__journal__systemd_unit'] + target_label: unit + - source_labels: ['__journal_priority_keyword'] + target_label: level + - source_labels: ['__journal_syslog_identifier'] + target_label: identifier + # sshd auth lines -> job=sshd-devvm (parity with the pve shipper's sshd-pve). + - source_labels: ['__journal_syslog_identifier'] + regex: 'sshd.*' + target_label: job + replacement: 'sshd-devvm' diff --git a/scripts/fail2ban-breakglass-sshd.local b/scripts/fail2ban-breakglass-sshd.local new file mode 100644 index 00000000..19066295 --- /dev/null +++ b/scripts/fail2ban-breakglass-sshd.local @@ -0,0 +1,18 @@ +# Break-glass SSH fail2ban jail (redesigned 2026-06-11). Source of truth. +# Deploy to the PVE host with: +# scp scripts/fail2ban-breakglass-sshd.local root@192.168.1.127:/etc/fail2ban/jail.d/breakglass-sshd.local +# ssh root@192.168.1.127 'systemctl restart fail2ban' +# +# GOTCHA (Debian 13 / OpenSSH 9.x): auth lines are logged under +# _COMM=sshd-session, NOT _COMM=sshd. The stock Debian jail keys journalmatch on +# `_SYSTEMD_UNIT=ssh.service + _COMM=sshd` and therefore silently NEVER bans. +# Match by unit only so both sshd and sshd-session lines are seen. Ban on both +# SSH ports (the WAN break-glass listener is :52222). +[sshd] +enabled = true +backend = systemd +journalmatch = _SYSTEMD_UNIT=ssh.service +port = ssh,52222 +maxretry = 4 +findtime = 10m +bantime = 1h diff --git a/scripts/fan-control.env.example b/scripts/fan-control.env.example index 3c2565c2..24394aec 100644 --- a/scripts/fan-control.env.example +++ b/scripts/fan-control.env.example @@ -1,21 +1,27 @@ -# /etc/fan-control.env — config for the fan-control daemon (chmod 600). +# /etc/fan-control.env — config for the fan-control ACTUATOR (chmod 600). # Deployed manually to the PVE host; the real file holds a secret token and is # NOT committed. Copy this template, fill HA_TOKEN, scp to /etc/fan-control.env. +# +# The control logic lives in Home Assistant (curve + bias + hysteresis + +# setpoint). This daemon only reads the HA-computed % and applies it over IPMI. # Long-lived ha-sofia access token (Home Assistant -> Profile -> Security -> -# Long-lived access tokens). Empty => presence disabled, daemon runs COOL-only. +# Long-lived access tokens). Used to read COMMAND_ENTITY. Empty/unreachable => +# the actuator hands the fans to Dell auto (it cannot compute a setpoint itself). HA_TOKEN= # --- optional overrides (defaults shown) --- # HA_URL=http://192.168.1.8:8123 -# GARAGE_ENTITY=sensor.garage_door_state_bg -# GARAGE_OPEN_STATE=Отворена -# HOLD_SECS=900 # quiet-mode hold after last garage activity (15 min) +# COMMAND_ENTITY=sensor.r730_fan_command_pct # HA-computed fan %; we only apply it +# STALE_SECS=1800 # command older than this => stale. Loose on purpose: +# # staleness only happens when CPU temp is flat (so the +# # held value is still valid); a rising temp re-renders it. +# HA_GRACE_SECS=300 # on a transient HA miss, HOLD the last applied % this +# # long before handing the fans to Dell auto (anti-flap) # LOOP_INTERVAL=15 -# PRESENCE_INTERVAL=30 -# DEADBAND=3 -# CEILING=83 # degC: hand back to Dell auto at/above this +# CEILING=83 # degC: hand back to Dell auto at/above this (hardware safety) # RESUME_BELOW=75 # RESUME_STABLE=120 # MAX_IPMI_FAILS=3 +# MIN_STEP=3 # smallest fan-% change worth an IPMI write (anti-jitter) PUSHGATEWAY_URL=http://10.0.20.100:30091 diff --git a/scripts/fan-control.service b/scripts/fan-control.service index f337ff4d..3b649751 100644 --- a/scripts/fan-control.service +++ b/scripts/fan-control.service @@ -1,5 +1,5 @@ [Unit] -Description=Presence-aware IPMI fan controller (Dell R730, garage) +Description=IPMI fan actuator (Dell R730) — applies the HA-computed setpoint Documentation=https://github.com/ViktorBarzin/infra/blob/master/scripts/fan-control.sh After=network-online.target Wants=network-online.target diff --git a/scripts/fan-control.sh b/scripts/fan-control.sh index 07d16fa5..6c5593f7 100644 --- a/scripts/fan-control.sh +++ b/scripts/fan-control.sh @@ -1,20 +1,23 @@ #!/usr/bin/env bash -# Presence-aware IPMI fan controller for the Dell R730 PVE host (192.168.1.127). +# IPMI fan ACTUATOR for the Dell R730 PVE host (192.168.1.127). # -# The server lives in the GARAGE (memory id=1723). Two curves, picked by -# whether someone is physically in the garage: -# - COOL : garage empty -> minimise CPU temp, noise is free. -# - QUIET : someone in the garage -> minimise noise, accept a warmer CPU. -# Presence comes from the ha-sofia garage-door sensor: door open now, OR it -# last changed within HOLD_SECS, => QUIET. Otherwise COOL. +# THIN ACTUATOR — the control logic lives entirely in Home Assistant. HA owns +# the curve thresholds, the duty %, the bias, and the final setpoint: it +# publishes ONE number, `sensor.r730_fan_command_pct` (= computed fan % incl. +# bias and any manual/lock override). This daemon does NOT compute anything — it +# just reads that command each loop and applies it over IPMI, and reads the raw +# sensors (temp/rpm) that feed HA/Prometheus. +# (Until 2026-06-07 the curve+hysteresis were computed HERE; moved to HA so +# all tuning + the setpoint determination happen on the dashboard.) # -# Safety (manual fan mode bypasses the iDRAC's own curve, so we backstop it): +# Safety (manual fan mode bypasses the iDRAC's own curve, so we backstop it). +# These are INDEPENDENT of HA — the actuator protects the hardware on its own: # - On ANY exit (crash/stop/TERM) the EXIT trap hands fans back to Dell -# automatic control (raw 0x30 0x30 0x01 0x01). systemd ExecStopPost -# repeats this belt-and-suspenders. +# automatic control (raw 0x30 0x30 0x01 0x01). systemd ExecStopPost repeats. # - CPU >= CEILING -> hand back to Dell auto until it recovers (RESUME_BELOW # held for RESUME_STABLE s). The firmware's own emergency cooling takes over. # - IPMI read failures (>= MAX_IPMI_FAILS) -> hand back to Dell auto. +# - HA unreachable / command missing / STALE -> hand back to Dell auto. # # Deploy: scp to /usr/local/bin/fan-control (strip .sh) + install # fan-control.service + /etc/fan-control.env. Same pattern as apply-mbps-caps. @@ -26,71 +29,38 @@ set -uo pipefail # ---- configuration (override via /etc/fan-control.env) ---- : "${IPMITOOL:=ipmitool}" -: "${LOOP_INTERVAL:=15}" # seconds between temperature decisions -: "${PRESENCE_INTERVAL:=30}" # seconds between ha-sofia garage-door polls -: "${DEADBAND:=3}" # degC hysteresis applied to downward fan steps +: "${LOOP_INTERVAL:=15}" # seconds between apply cycles : "${CEILING:=83}" # degC: hand back to Dell auto at/above this : "${RESUME_BELOW:=75}" # degC: eligible to resume manual below this... : "${RESUME_STABLE:=120}" # ...once held that long -: "${HOLD_SECS:=900}" # quiet-mode hold after last garage activity (15 min) : "${HA_URL:=http://192.168.1.8:8123}" -: "${HA_TOKEN:=}" # long-lived ha-sofia token; empty => presence disabled (COOL only) -: "${GARAGE_ENTITY:=sensor.garage_door_state_bg}" -: "${GARAGE_OPEN_STATE:=Отворена}" # ha state string meaning "open" -# HA control: a mode select + manual % the user drives from Home Assistant. -# auto => garage-presence curve (default); cool/quiet => force that curve; -# manual => hold MANUAL_ENTITY %. Empty HA_TOKEN or unreachable HA => auto. -: "${MODE_ENTITY:=input_select.r730_fan_mode}" -: "${MANUAL_ENTITY:=input_number.r730_fan_manual_pct}" +: "${HA_TOKEN:=}" # long-lived ha-sofia token; empty => Dell auto (no control) +: "${COMMAND_ENTITY:=sensor.r730_fan_command_pct}" # HA-computed fan %; we only apply it +: "${STALE_SECS:=1800}" # command older than this => stale. Loose on purpose: + # staleness only happens when CPU temp is flat (so the + # held value is still valid); a rising temp re-renders it. +: "${HA_GRACE_SECS:=300}" # on a transient HA miss, HOLD the last applied % for this + # long before handing the fans to Dell auto (anti-flap) : "${PUSHGATEWAY_URL:=}" # optional Prometheus Pushgateway base URL : "${MAX_IPMI_FAILS:=3}" +: "${MIN_STEP:=3}" # min fan-% change worth an IPMI write (anti-jitter) : "${DRY_RUN:=0}" # 1 => log IPMI actions instead of executing : "${RUN_ONCE:=0}" # 1 => one iteration then exit (testing) -# Continuous LINEAR fan curve (2026-06-05): fan% ramps proportionally with CPU -# temp between (T_LO,P_LO) and (T_HI,P_HI), clamped flat outside. Replaces the old -# discrete step-bands (which flapped at band edges — e.g. 45<->65%). Both modes -# reach 100% right at the 83°C ceiling. Anchors are env-tunable. -# COOL (garage empty): 30% @50°C .. 100% @83°C (~2.1%/°C; equilibrium ~60°C/~51%) -# QUIET (someone there): 20% @68°C .. 100% @83°C (near-silent until ~70°C) -# Web-researched: a linear curve + 2-3°C hysteresis is the homelab standard; PID is -# overkill for this slow thermal loop. See docs/plans/2026-06-04-pve-fan-control-design.md. -: "${COOL_T_LO:=50}"; : "${COOL_P_LO:=30}"; : "${COOL_T_HI:=83}"; : "${COOL_P_HI:=100}" -: "${QUIET_T_LO:=68}"; : "${QUIET_P_LO:=20}"; : "${QUIET_T_HI:=83}"; : "${QUIET_P_HI:=100}" -: "${MIN_STEP:=3}" # min fan-% change worth an IPMI write (anti-jitter on the smooth curve) - log() { printf '%s %s\n' "$(date '+%Y-%m-%dT%H:%M:%S%z')" "$*"; } # ---- pure functions (no side effects; unit-tested) ---- -# fc_curve <mode> <temp> -> fan percent (continuous linear interpolation between -# the per-mode (T_LO,P_LO)..(T_HI,P_HI) anchors; clamped flat outside the range). -fc_curve() { - local mode="$1" temp="$2" tlo plo thi phi - if [[ "$mode" == "quiet" ]]; then tlo=$QUIET_T_LO; plo=$QUIET_P_LO; thi=$QUIET_T_HI; phi=$QUIET_P_HI - else tlo=$COOL_T_LO; plo=$COOL_P_LO; thi=$COOL_T_HI; phi=$COOL_P_HI; fi - if (( temp <= tlo )); then echo "$plo"; return 0; fi - if (( temp >= thi )); then echo "$phi"; return 0; fi - echo $(( plo + ( (temp - tlo) * (phi - plo) + (thi - tlo) / 2 ) / (thi - tlo) )) # rounded +# fc_num <value> <fallback> <min> <max> -> validated integer (floats truncated; +# non-numeric => fallback; out-of-range clamped). Sanitises the HA command read. +fc_num() { + local v="${1%%.*}" fb="$2" lo="$3" hi="$4" + [[ "$v" =~ ^-?[0-9]+$ ]] || { echo "$fb"; return 0; } + (( v < lo )) && v="$lo"; (( v > hi )) && v="$hi"; echo "$v" } -# fc_decide <mode> <temp> <current_pct> <deadband> -> fan percent -# Ramps up immediately; only steps down once the curve still wants a lower -# percent even DEADBAND degrees hotter (prevents flapping at band edges). -fc_decide() { - local mode="$1" temp="$2" current="$3" deadband="$4" target - target="$(fc_curve "$mode" "$temp")" - if (( current < 0 || target >= current )); then echo "$target"; return 0; fi - if (( $(fc_curve "$mode" "$((temp + deadband))") < current )); then echo "$target"; else echo "$current"; fi -} - -# fc_presence_mode <state> <last_changed_epoch> <now_epoch> <hold_secs> <open_state> -> quiet|cool -fc_presence_mode() { - local state="$1" lc="$2" now="$3" hold="$4" open="$5" - if [[ "$state" == "$open" ]]; then echo "quiet"; return 0; fi - if (( now - lc < hold )); then echo "quiet"; return 0; fi - echo "cool" -} +# fc_fresh <age_secs> <max_secs> -> exit 0 if fresh (age <= max), else 1. +fc_fresh() { (( $1 <= $2 )); } # fc_parse_temp <ipmitool 'Temp' line> -> integer degC fc_parse_temp() { @@ -115,18 +85,6 @@ fc_clamp() { local p="$1"; (( p < 0 )) && p=0; (( p > 100 )) && p=100; echo "$p" # (~2W @4800rpm · ~17W @9360 · ~42W @12720 · ~99W @16920). Integer: 0.0205·(rpm/1e3)³. fc_fan_watts() { echo $(( $1 * $1 * $1 * 205 / 10000000000000 )); } -# fc_resolve <ha_mode> <temp> <manual_pct> <presence> <current> <deadband> -> pct -# HA mode resolution (the hard ceiling is handled by the caller): -# manual -> clamp(manual_pct), no hysteresis -# cool|quiet -> that curve (with hysteresis) -# auto (else) -> presence-driven curve (garage door) -fc_resolve() { - local ha_mode="$1" temp="$2" manual_pct="$3" presence="$4" current="$5" deadband="$6" - if [[ "$ha_mode" == "manual" ]]; then fc_clamp "$manual_pct"; return 0; fi - local eff; [[ "$ha_mode" == "auto" ]] && eff="$presence" || eff="$ha_mode" - fc_decide "$eff" "$temp" "$current" "$deadband" -} - # ---- side-effecting wrappers ---- ipmi_manual_on=0 @@ -151,39 +109,34 @@ read_cpu_temp() { fc_parse_temp "$("$IPMITOOL" sdr type temperature 2>/dev/null | grep -E '^Temp ' | head -1)" } -read_fan_rpm() { # Fan1 RPM — representative (all 6 fans are set together) - "$IPMITOOL" sdr type fan 2>/dev/null | awk -F'|' '/^Fan1/{gsub(/[^0-9]/,"",$5); print $5+0; exit}' +read_fan_rpm() { # mean RPM across all 6 chassis fans (Fan1..Fan6). All fans run + # one global duty, so the mean is representative AND a single + # stalled fan won't skew it. Telemetry only — not a control input. + "$IPMITOOL" sdr type fan 2>/dev/null | awk -F'|' ' + /^Fan[0-9]/ { gsub(/[^0-9]/, "", $5); if ($5 != "") { sum += $5; n++ } } + END { if (n > 0) printf "%d\n", (sum / n) + 0.5 }' } -presence_cache="cool"; presence_ts=0 -get_presence() { - local now; now="$(date +%s)" - if (( now - presence_ts < PRESENCE_INTERVAL )); then echo "$presence_cache"; return 0; fi - presence_ts="$now" - [[ -z "$HA_TOKEN" ]] && { echo "$presence_cache"; return 0; } - local resp state lc_iso lc_epoch - resp="$(curl -fsS --max-time 5 -H "Authorization: Bearer $HA_TOKEN" \ - "$HA_URL/api/states/$GARAGE_ENTITY" 2>/dev/null)" || { echo "$presence_cache"; return 0; } - state="$(fc_json_str_field "$resp" state)" - [[ -z "$state" ]] && { echo "$presence_cache"; return 0; } - lc_iso="$(fc_json_str_field "$resp" last_changed)" - lc_epoch="$(date -d "$lc_iso" +%s 2>/dev/null || echo "$now")" - presence_cache="$(fc_presence_mode "$state" "$lc_epoch" "$now" "$HOLD_SECS" "$GARAGE_OPEN_STATE")" - echo "$presence_cache" -} - -# ha_entity_state <entity> -> state string (empty if HA disabled/unreachable) -ha_entity_state() { +# ha_command_pct -> the HA-computed fan % (0..100 int), or EMPTY when HA is +# disabled/unreachable, the value is non-numeric, or the command is STALE +# (last_updated older than STALE_SECS). Empty => caller hands fans to Dell auto. +ha_command_pct() { [[ -z "$HA_TOKEN" ]] && return 0 - local resp + local resp state lu lu_epoch now resp="$(curl -fsS --max-time 5 -H "Authorization: Bearer $HA_TOKEN" \ - "$HA_URL/api/states/$1" 2>/dev/null)" || return 0 - fc_json_str_field "$resp" state + "$HA_URL/api/states/$COMMAND_ENTITY" 2>/dev/null)" || return 0 + state="$(fc_json_str_field "$resp" state)" + [[ "$state" =~ ^[0-9]+(\.[0-9]+)?$ ]] || return 0 + lu="$(fc_json_str_field "$resp" last_updated)" + lu_epoch="$(date -d "$lu" +%s 2>/dev/null || echo 0)"; now="$(date +%s)" + (( lu_epoch == 0 )) && return 0 + fc_fresh "$((now - lu_epoch))" "$STALE_SECS" || return 0 + fc_num "$state" 0 0 100 } push_metrics() { # <temp> <pct> <mode> <ha_ok> <fallback> [fan_rpm] [fan_watts_est] [[ -z "$PUSHGATEWAY_URL" ]] && return 0 - local mode_num; case "$3" in quiet) mode_num=1;; cool) mode_num=2;; manual) mode_num=3;; *) mode_num=0;; esac + local mode_num; case "$3" in applied) mode_num=2;; *) mode_num=0;; esac curl -fsS --max-time 5 --data-binary @- \ "$PUSHGATEWAY_URL/metrics/job/fan_control/instance/pve-r730" >/dev/null 2>&1 <<EOF || true # TYPE pve_fan_control_cpu_temp_celsius gauge @@ -204,10 +157,12 @@ EOF } main() { - log "fan-control start (loop=${LOOP_INTERVAL}s presence=${PRESENCE_INTERVAL}s hold=${HOLD_SECS}s ceiling=${CEILING}C dry_run=${DRY_RUN})" + log "fan-control start (actuator; loop=${LOOP_INTERVAL}s ceiling=${CEILING}C cmd=${COMMAND_ENTITY} stale=${STALE_SECS}s dry_run=${DRY_RUN})" trap 'log "exit — restoring Dell auto fan control"; restore_auto' EXIT - local current=-1 fails=0 in_fallback=0 cool_since=0 + local current=-1 fails=0 in_fallback=0 cool_since=0 ha_down=0 ha_misses=0 while true; do + local rpm fan_w; rpm="$(read_fan_rpm)"; rpm="${rpm:-0}"; fan_w="$(fc_fan_watts "$rpm")" + local temp; temp="$(read_cpu_temp)" if [[ -z "$temp" ]]; then fails=$((fails + 1)); log "WARN cannot read CPU temp ($fails/$MAX_IPMI_FAILS)" @@ -216,45 +171,53 @@ main() { fi fails=0 + # Hardware ceiling — independent of HA; firmware emergency cooling takes over. if (( temp >= CEILING )); then (( in_fallback == 0 )) && { log "CEILING temp=${temp}≥${CEILING} — Dell auto"; restore_auto; current=-1; in_fallback=1; } - push_metrics "$temp" 0 fallback 1 1 + push_metrics "$temp" 0 fallback 1 1 "$rpm" "$fan_w" (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } fi if (( in_fallback == 1 )); then if (( temp < RESUME_BELOW )); then (( cool_since == 0 )) && cool_since="$(date +%s)" if (( $(date +%s) - cool_since >= RESUME_STABLE )); then - log "recovered (temp<${RESUME_BELOW}C ${RESUME_STABLE}s) — resuming manual"; in_fallback=0; cool_since=0 + log "recovered (temp<${RESUME_BELOW}C ${RESUME_STABLE}s) — resuming HA control"; in_fallback=0; cool_since=0 else - push_metrics "$temp" 0 fallback 1 1; (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } + push_metrics "$temp" 0 fallback 1 1 "$rpm" "$fan_w"; (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } fi else - cool_since=0; push_metrics "$temp" 0 fallback 1 1 + cool_since=0; push_metrics "$temp" 0 fallback 1 1 "$rpm" "$fan_w" (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } fi fi - # HA-desired mode (auto/cool/quiet/manual); unreachable/unset => auto. - local ha_mode ha_ok=1; ha_mode="$(ha_entity_state "$MODE_ENTITY")"; [[ -z "$HA_TOKEN" ]] && ha_ok=0 - [[ -z "$ha_mode" ]] && ha_mode="auto" - case "$ha_mode" in auto|cool|quiet|manual) ;; *) ha_mode="auto" ;; esac - local manual_pct=0 - if [[ "$ha_mode" == "manual" ]]; then - manual_pct="$(ha_entity_state "$MANUAL_ENTITY")"; manual_pct="${manual_pct%%.*}" - [[ "$manual_pct" =~ ^[0-9]+$ ]] || manual_pct=0 + # The setpoint is whatever HA computed. No local math — just apply it. + local cmd; cmd="$(ha_command_pct)" + if [[ -z "$cmd" ]]; then + ha_misses=$((ha_misses + 1)) + if (( current >= 0 && ha_misses * LOOP_INTERVAL < HA_GRACE_SECS )); then + # Transient HA loss — HOLD the last applied %; do NOT touch the fans. A brief + # command blip (sensor unavailable / stale / fetch hiccup) must not dump the + # fans to Dell auto. The 83C CEILING above (our own IPMI read) is the real + # overheat safety, so holding the last good % is safe. + (( ha_misses == 1 )) && log "HA command miss — holding ${current}% (grace ${HA_GRACE_SECS}s)" + push_metrics "$temp" "$current" applied 0 0 "$rpm" "$fan_w" + else + # Sustained loss (or nothing applied yet) — hand the fans to Dell auto. + (( ha_down == 0 )) && { log "HA command lost (${ha_misses} misses) — Dell auto"; restore_auto; current=-1; ha_down=1; } + push_metrics "$temp" 0 fallback 0 1 "$rpm" "$fan_w" + fi + (( RUN_ONCE == 1 )) && break || { sleep "$LOOP_INTERVAL"; continue; } fi - local presence="cool"; [[ "$ha_mode" == "auto" ]] && presence="$(get_presence)" - local eff; if [[ "$ha_mode" == "manual" ]]; then eff="manual"; elif [[ "$ha_mode" == "auto" ]]; then eff="$presence"; else eff="$ha_mode"; fi - local pct; pct="$(fc_resolve "$ha_mode" "$temp" "$manual_pct" "$presence" "$current" "$DEADBAND")" - # Only write when first-run or the change clears MIN_STEP (kills 1-2% jitter - # on the continuous curve; fc_decide already gives asymmetric hysteresis). - if (( current < 0 || pct - current >= MIN_STEP || current - pct >= MIN_STEP )); then - if set_manual "$pct"; then log "temp=${temp}C ha_mode=${ha_mode} eff=${eff} fan=${pct}% (was ${current}%)"; current="$pct" - else log "WARN set_manual ${pct}% failed"; fi + ha_misses=0 + (( ha_down == 1 )) && { log "HA command back (${cmd}%) — resuming"; ha_down=0; } + + # Only write when first-run or the change clears MIN_STEP (kills 1-2% jitter). + if (( current < 0 || cmd - current >= MIN_STEP || current - cmd >= MIN_STEP )); then + if set_manual "$cmd"; then log "temp=${temp}C cmd=${cmd}% rpm=${rpm} (was ${current}%)"; current="$cmd" + else log "WARN set_manual ${cmd}% failed"; fi fi - local rpm fan_w; rpm="$(read_fan_rpm)"; rpm="${rpm:-0}"; fan_w="$(fc_fan_watts "$rpm")" - push_metrics "$temp" "$current" "$eff" "$ha_ok" 0 "$rpm" "$fan_w" + push_metrics "$temp" "$current" applied 1 0 "$rpm" "$fan_w" (( RUN_ONCE == 1 )) && break || sleep "$LOOP_INTERVAL" done } diff --git a/scripts/nfs-mirror.sh b/scripts/nfs-mirror.sh index 2e322ede..3e293c03 100644 --- a/scripts/nfs-mirror.sh +++ b/scripts/nfs-mirror.sh @@ -54,11 +54,14 @@ PUSHGATEWAY="${NFS_MIRROR_PUSHGATEWAY:-http://10.0.20.100:30091}" PUSHGATEWAY_JOB=nfs-mirror EXCLUDES=( - # ---- /mnt/backup subtrees owned by daily-backup — leave alone ---- + # ---- /mnt/backup subtrees owned by OTHER backup jobs — leave alone ---- + # Without these, the top-level `rsync --delete /srv/nfs/ → /mnt/backup/` below + # reaps any /mnt/backup dir that has no /srv/nfs counterpart. --exclude='/pvc-data/' --exclude='/sqlite-backup/' --exclude='/pfsense/' --exclude='/pve-config/' + --exclude='/vzdump/' # VM images from vzdump-vms — NOT a /srv/nfs svc (else --delete reaps them nightly) --exclude='/lost+found/' # ---- state files used by other backup jobs ---- @@ -90,6 +93,16 @@ EXCLUDES=( --exclude='*@synoeastream' --exclude='/.DS_Store' --exclude='/Thumbs.db' + + # ---- transient SQLite sidecars (WAL mode) ---- + # Created/checkpointed/deleted constantly, so they vanish mid-rsync and trip + # exit code 24 (root cause of NfsMirrorFailing on calibre-web-automated's + # queue.db, 2026-05/06). They must NEVER be in a raw mirror anyway: a -wal/-shm + # without an atomic .db snapshot is useless to restore from. Consistent SQLite + # copies are made separately by daily-backup (SQLite backup API). + --exclude='*-wal' + --exclude='*-shm' + --exclude='*-journal' ) log() { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] $*" | tee -a "$LOG"; } @@ -152,7 +165,12 @@ rsync \ DST_BYTES=$(df -B1 --output=used /mnt/backup | tail -1) -if [ "$RSYNC_RC" -eq 0 ]; then +# rsync exit 24 = "some source files vanished before transfer" — benign for a +# backup mirror: everything else copied; the vanished files are transient (e.g. +# SQLite WAL/SHM, now mostly caught by the excludes above). Treat as success so +# the offsite manifest still updates and NfsMirrorFailing doesn't false-fire. +if [ "$RSYNC_RC" -eq 0 ] || [ "$RSYNC_RC" -eq 24 ]; then + [ "$RSYNC_RC" -eq 24 ] && warn "rsync exited 24 (source files vanished mid-transfer) — treating as success" # Capture files that rsync created/modified and feed them to the offsite-sync # manifest so daily Step 1 incremental picks them up tomorrow morning. # Use -cnewer (ctime), not -newer (mtime): rsync -t preserves SOURCE mtime diff --git a/scripts/offinfra-onboard b/scripts/offinfra-onboard new file mode 100755 index 00000000..ae811685 --- /dev/null +++ b/scripts/offinfra-onboard @@ -0,0 +1,255 @@ +#!/usr/bin/env bash +# offinfra-onboard — migrate a Canonical (Forgejo) repo's image build to +# GitHub Actions → ghcr.io (ADR-0002, PRD infra#10). Idempotent: re-running +# skips every already-done step. +# +# What it does: +# 1. Ensures the GitHub mirror repo exists (right visibility; unarchives). +# 2. Sets GHA secrets (WOODPECKER_TOKEN, FORGEJO_GIT_TOKEN, SLACK_WEBHOOK). +# 3. Ensures the Forgejo push-mirror (sync_on_commit) + fires an initial sync. +# 4. Registers the mirror in Woodpecker (github forge) → deploy repo id. +# 5. Renders .github/workflows/build.yml + .woodpecker/deploy.yml into the +# clone, removes the old in-cluster build pipeline. +# 6. Commits on the FORGEJO side and pushes master (this fires the chain). +# 7. Flips the GitHub default branch to master once the mirror has synced. +# +# Usage: +# offinfra-onboard <name> --clone <path> --visibility private|public \ +# --namespace <ns> --deploy "<deployment>=<container>[,<container>...]" \ +# [--image <ghcr-image-name>] [--context <docker-context>] [--dockerfile <path>] \ +# [--test-steps <yaml-file>] [--dry-run] +# +# --deploy is repeatable. --test-steps points at a YAML fragment of extra +# steps for the lint-and-test job (indented for a `steps:` list); omitted = +# a no-op test job. +set -euo pipefail + +SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd) +TEMPLATES="$SCRIPT_DIR/offinfra-templates" +GH_OWNER="ViktorBarzin" +FORGEJO_HOST="forgejo.viktorbarzin.me" +FORGEJO_LB="10.0.20.203" +WP_API="https://ci.viktorbarzin.me/api" + +NAME=${1:?usage: offinfra-onboard <name> [flags]}; shift +CLONE="" VISIBILITY="" NAMESPACE="" IMAGE="" CONTEXT="." DOCKERFILE="Dockerfile" TEST_STEPS_FILE="" DRY_RUN=0 NO_DEPLOY=0 +DEPLOYS=() +while [ $# -gt 0 ]; do + case "$1" in + --clone) CLONE=$2; shift 2;; + --visibility) VISIBILITY=$2; shift 2;; + --namespace) NAMESPACE=$2; shift 2;; + --image) IMAGE=$2; shift 2;; + --context) CONTEXT=$2; shift 2;; + --dockerfile) DOCKERFILE=$2; shift 2;; + --deploy) DEPLOYS+=("$2"); shift 2;; + --test-steps) TEST_STEPS_FILE=$2; shift 2;; + --no-deploy) NO_DEPLOY=1; shift;; + --dry-run) DRY_RUN=1; shift;; + *) echo "unknown flag: $1" >&2; exit 1;; + esac +done +IMAGE=${IMAGE:-$NAME} +[ -n "$CLONE" ] && [ -d "$CLONE/.git" ] || { echo "--clone must point at a git clone" >&2; exit 1; } +[ "$VISIBILITY" = "private" ] || [ "$VISIBILITY" = "public" ] || { echo "--visibility private|public" >&2; exit 1; } +[ -n "$NAMESPACE" ] || { echo "--namespace required" >&2; exit 1; } +[ ${#DEPLOYS[@]} -gt 0 ] || [ "$NO_DEPLOY" = 1 ] || { echo "at least one --deploy required (or --no-deploy for CronJob-only repos)" >&2; exit 1; } + +log() { printf '\033[1m[%s]\033[0m %s\n' "$NAME" "$*"; } +run() { if [ "$DRY_RUN" = 1 ]; then echo "DRY: $*"; else "$@"; fi; } + +# --- credentials --- +export VAULT_ADDR=${VAULT_ADDR:-https://vault.viktorbarzin.me} +WP_TOKEN=$(vault kv get -field=woodpecker_api_token secret/ci/global) +SLACK_WEBHOOK=$(vault kv get -field=slack_webhook secret/ci/global) +GH_PAT=$(vault kv get -field=github_pat secret/viktor) +# Forgejo token: from the clone's forgejo remote URL (the documented contract) +FORGEJO_REMOTE=$(git -C "$CLONE" remote -v | awk -v h="$FORGEJO_HOST" '$2 ~ h && $3 == "(push)" {print $1; exit}') +[ -n "$FORGEJO_REMOTE" ] || { echo "no forgejo remote in $CLONE" >&2; exit 1; } +FORGEJO_TOKEN=$(git -C "$CLONE" remote get-url "$FORGEJO_REMOTE" | sed -n 's#https://[^:]*:\([^@]*\)@.*#\1#p') +# Fallback: clones using the credential-store helper carry no token in the URL +[ -n "$FORGEJO_TOKEN" ] || FORGEJO_TOKEN=$(sed -n "s#https://[^:]*:\([^@]*\)@$FORGEJO_HOST.*#\1#p" ~/.git-credentials 2>/dev/null | head -1) +[ -n "$FORGEJO_TOKEN" ] || { echo "could not extract forgejo token (remote URL or ~/.git-credentials)" >&2; exit 1; } + +FJ() { curl -sf --resolve "$FORGEJO_HOST:443:$FORGEJO_LB" -H "Authorization: token $FORGEJO_TOKEN" -H 'Content-Type: application/json' "$@"; } + +# Clone must be clean and CURRENT — the final push goes straight to master +# (hit live on job-hunter: stale clone -> rejected push + rebase conflict). +[ -z "$(git -C "$CLONE" status --porcelain)" ] || { echo "clone $CLONE is dirty — commit/stash first" >&2; exit 1; } +git -C "$CLONE" fetch "$FORGEJO_REMOTE" >/dev/null 2>&1 +git -C "$CLONE" merge --ff-only "$FORGEJO_REMOTE/master" >/dev/null 2>&1 || { echo "clone $CLONE could not fast-forward to $FORGEJO_REMOTE/master" >&2; exit 1; } +WP() { curl -sf -H "Authorization: Bearer $WP_TOKEN" -H 'Content-Type: application/json' "$@"; } + +# --- 1) GitHub mirror repo --- +if state=$(gh api "repos/$GH_OWNER/$NAME" --jq '{archived,private}' 2>/dev/null); then + archived=$(jq -r .archived <<<"$state"); private=$(jq -r .private <<<"$state") + if [ "$archived" = "true" ]; then + log "GitHub repo exists but is ARCHIVED — unarchiving" + run gh api -X PATCH "repos/$GH_OWNER/$NAME" -F archived=false >/dev/null + fi + want_private=$([ "$VISIBILITY" = private ] && echo true || echo false) + if [ "$private" != "$want_private" ]; then + log "setting visibility -> $VISIBILITY" + run gh api -X PATCH "repos/$GH_OWNER/$NAME" -F private="$want_private" >/dev/null + else + log "GitHub repo visibility already $VISIBILITY — SKIP" + fi +else + log "creating GitHub mirror repo ($VISIBILITY)" + run gh repo create "$GH_OWNER/$NAME" "--$VISIBILITY" \ + --description "One-way mirror of forgejo viktor/$NAME — do NOT commit here (ADR-0002)" >/dev/null +fi + +# --- 2) GHA secrets --- +log "setting GHA secrets (WOODPECKER_TOKEN, FORGEJO_GIT_TOKEN, SLACK_WEBHOOK)" +if [ "$DRY_RUN" = 0 ]; then + gh secret set WOODPECKER_TOKEN -R "$GH_OWNER/$NAME" --body "$WP_TOKEN" + gh secret set FORGEJO_GIT_TOKEN -R "$GH_OWNER/$NAME" --body "$FORGEJO_TOKEN" + gh secret set SLACK_WEBHOOK -R "$GH_OWNER/$NAME" --body "$SLACK_WEBHOOK" +fi + +# --- 3) Forgejo push-mirror --- +mirrors=$(FJ "https://$FORGEJO_HOST/api/v1/repos/viktor/$NAME/push_mirrors" || echo '[]') +if printf '%s' "$mirrors" | jq -e --arg a "github.com/$GH_OWNER/$NAME" '.[] | select(.remote_address | contains($a))' >/dev/null 2>&1; then + log "push-mirror already configured — SKIP" +else + log "creating push-mirror -> github.com/$GH_OWNER/$NAME (sync_on_commit)" + run FJ -X POST "https://$FORGEJO_HOST/api/v1/repos/viktor/$NAME/push_mirrors" \ + -d "{\"remote_address\":\"https://github.com/$GH_OWNER/$NAME.git\",\"remote_username\":\"$GH_OWNER\",\"remote_password\":$(jq -Rn --arg p "$GH_PAT" '$p'),\"interval\":\"8h0m0s\",\"sync_on_commit\":true}" >/dev/null +fi +log "firing initial mirror sync" +run FJ -X POST "https://$FORGEJO_HOST/api/v1/repos/viktor/$NAME/push_mirrors-sync" >/dev/null || true + +# --- 4) Woodpecker registration (github forge) --- +if [ "$NO_DEPLOY" = 1 ]; then + log "--no-deploy: skipping Woodpecker registration (CronJob-only; :latest+Always picks up builds)" + WP_REPO_ID="0" +else +WP_ROW=$(WP "$WP_API/repos?perPage=100" | jq -c --arg n "$GH_OWNER/$NAME" '[.[] | select(.full_name == $n)] | first // empty') +WP_REPO_ID=$(jq -r '.id // empty' <<<"$WP_ROW") +if [ -n "$WP_REPO_ID" ] && [ "$(jq -r .active <<<"$WP_ROW")" = "true" ]; then + log "Woodpecker repo already registered + active (id=$WP_REPO_ID) — SKIP" +elif [ -n "$WP_REPO_ID" ]; then + # Registered but INACTIVE (e.g. the old GHA-era registration was + # deactivated — hit live on f1-stream, repo 10): re-activate in place. + GH_REPO_ID=$(gh api "repos/$GH_OWNER/$NAME" --jq .id) + log "Woodpecker repo $WP_REPO_ID exists but is INACTIVE — re-activating" + run WP -X POST "$WP_API/repos?forge_remote_id=$GH_REPO_ID" >/dev/null +else + GH_REPO_ID=$(gh api "repos/$GH_OWNER/$NAME" --jq .id) + log "registering mirror in Woodpecker (forge_remote_id=$GH_REPO_ID)" + if [ "$DRY_RUN" = 0 ]; then + WP_REPO_ID=$(WP -X POST "$WP_API/repos?forge_remote_id=$GH_REPO_ID" | jq -r .id) + else + WP_REPO_ID="DRY" + fi + log "Woodpecker repo id = $WP_REPO_ID" +fi + +# Normalize repo settings: TRUSTED repos get netrc injected into EVERY step +# container; bitnami/kubectl (non-root, HOME=/) then dies with +# "//.netrc: Permission denied" (hit live on f1-stream repo 10, an old-era +# registration that carried trusted=true; tripit 167 is untrusted and works). +if [ "$DRY_RUN" = 0 ]; then + run WP -X PATCH "$WP_API/repos/$WP_REPO_ID" \ + -d '{"trusted":{"network":false,"volumes":false,"security":false}}' >/dev/null \ + && log "Woodpecker repo settings normalized (untrusted)" +fi +fi + +# --- 5) Render workflow + deploy files into the clone --- +DEPLOY_CMDS="" +for d in "${DEPLOYS[@]}"; do + dep=${d%%=*}; containers=${d#*=} + setargs="" + IFS=',' read -ra cs <<<"$containers" + for c in "${cs[@]}"; do setargs="$setargs $c=\${IMAGE_NAME}:\${IMAGE_TAG}"; done + DEPLOY_CMDS="$DEPLOY_CMDS - \"kubectl -n $NAMESPACE set image deployment/$dep$setargs\"\n" + DEPLOY_CMDS="$DEPLOY_CMDS - \"kubectl -n $NAMESPACE rollout status deployment/$dep --timeout=300s\"\n" +done +if [ -n "$TEST_STEPS_FILE" ]; then + TEST_STEPS=$(cat "$TEST_STEPS_FILE") +else + TEST_STEPS=' - run: echo "no test steps configured"' +fi + +export T_NAME=$NAME T_IMAGE=$IMAGE T_CONTEXT=$CONTEXT T_DOCKERFILE=$DOCKERFILE T_WPID=$WP_REPO_ID T_TEST="$TEST_STEPS" T_DEPLOY="$DEPLOY_CMDS" +render() { # $1=template $2=dest + python3 - "$1" "$2" <<'PYEOF' +import os, sys +src, dst = sys.argv[1], sys.argv[2] +s = open(src).read() +s = s.replace('{{NAME}}', os.environ['T_NAME']) +s = s.replace('{{IMAGE}}', os.environ['T_IMAGE']) +s = s.replace('{{CONTEXT}}', os.environ['T_CONTEXT']) +s = s.replace('{{DOCKERFILE}}', os.environ['T_DOCKERFILE']) +s = s.replace('{{WP_REPO_ID}}', os.environ['T_WPID']) +s = s.replace('{{TEST_STEPS}}', os.environ['T_TEST']) +s = s.replace('{{DEPLOY_CMDS}}', os.environ['T_DEPLOY'].replace('\\n', '\n').rstrip('\n')) +os.makedirs(os.path.dirname(dst), exist_ok=True) +open(dst, 'w').write(s) +PYEOF +} +log "rendering build.yml$([ "$NO_DEPLOY" = 1 ] && echo ' (no deploy job)' || echo ' + deploy.yml')" +if [ "$DRY_RUN" = 0 ]; then + render "$TEMPLATES/build.yml.tmpl" "$CLONE/.github/workflows/build.yml" + if [ "$NO_DEPLOY" = 1 ]; then + # CronJob-only: drop the deploy job (everything from " deploy:" to the + # notify job) — :latest+Always CronJobs pick up new builds on next run. + python3 - "$CLONE/.github/workflows/build.yml" <<'PYDEL' +import sys +p=sys.argv[1]; lines=open(p).read().split("\n") +out=[]; skip=False +for l in lines: + if l.rstrip() == " deploy:": skip=True + if l.rstrip() == " notify-failure:": skip=False + if not skip: out.append(l) +open(p,"w").write("\n".join(out).replace("needs: [lint-and-test, build, deploy]","needs: [lint-and-test, build]")) +PYDEL + rm -f "$CLONE/.woodpecker/deploy.yml" + else + render "$TEMPLATES/deploy.yml.tmpl" "$CLONE/.woodpecker/deploy.yml" + fi +fi + +# --- 6) Remove old in-cluster build pipeline + commit on Forgejo side --- +cd "$CLONE" +OLD_REMOVED="" +for f in .woodpecker.yml .woodpecker/build.yml .woodpecker/build-fallback.yml; do + [ -f "$f" ] && { run git rm -q "$f"; OLD_REMOVED="$OLD_REMOVED $f"; } +done +run git add .github/workflows/build.yml +[ -f .woodpecker/deploy.yml ] && run git add .woodpecker/deploy.yml +if git diff --cached --quiet 2>/dev/null; then + log "no changes to commit — SKIP (already migrated)" +else + log "committing + pushing to forgejo master (this fires the chain)" + run git commit -q -m "ci: move image build off-infra to GHA -> ghcr (ADR-0002) + +Generated by infra/scripts/offinfra-onboard: GHA builds+tests on the +GitHub mirror, pushes ghcr.io/viktorbarzin/$IMAGE, then triggers the +Woodpecker deploy (repo $WP_REPO_ID). Old in-cluster build pipeline +removed:$OLD_REMOVED + +Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>" + run git push "$FORGEJO_REMOTE" master +fi + +# --- 7) GitHub default branch -> master (after mirror sync) --- +if [ "$DRY_RUN" = 0 ]; then + for i in $(seq 1 24); do + if gh api "repos/$GH_OWNER/$NAME/branches/master" >/dev/null 2>&1; then + cur=$(gh api "repos/$GH_OWNER/$NAME" --jq .default_branch) + [ "$cur" != "master" ] && gh api -X PATCH "repos/$GH_OWNER/$NAME" -F default_branch=master >/dev/null && log "default branch -> master" + break + fi + sleep 5 + done +fi + +log "DONE. Verify the chain:" +echo " - GHA run: gh run list -R $GH_OWNER/$NAME --limit 3" +echo " - ghcr tags: (token exchange) https://ghcr.io/v2/viktorbarzin/$IMAGE/tags/list" +echo " - WP deploy: $WP_API/repos/$WP_REPO_ID/pipelines" +echo " - rollout: kubectl -n $NAMESPACE get deploy" +echo " - pull secret: ensure the Deployment carries ghcr-credentials (Kyverno allowlist + stack imagePullSecrets)" diff --git a/scripts/offinfra-templates/build.yml.tmpl b/scripts/offinfra-templates/build.yml.tmpl new file mode 100644 index 00000000..ff6712e7 --- /dev/null +++ b/scripts/offinfra-templates/build.yml.tmpl @@ -0,0 +1,116 @@ +name: Build and Push + +# Off-infra build (ADR-0002). Canonical repo is Forgejo viktor/{{NAME}}, which +# push-mirrors here; this workflow builds on GitHub-hosted runners, pushes the +# image to GHCR, then signals the Woodpecker deploy pipeline (repo {{WP_REPO_ID}}) +# to roll the cluster — the homelab never sees build IO or registry pushes. +# +# Committed on the FORGEJO side (the mirror is one-way; commits made on GitHub +# are overwritten by the next sync). Generated by infra/scripts/offinfra-onboard. +on: + push: + branches: [master] + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + lint-and-test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 +{{TEST_STEPS}} + + build: + needs: lint-and-test + runs-on: ubuntu-latest + outputs: + image_tag: ${{ steps.meta.outputs.sha }} + steps: + - uses: actions/checkout@v6 + with: + fetch-depth: 0 # full history + tags so svu sees the last vX.Y.Z + fetch-tags: true + # Auto-semver (svu): tag-only, pushed to CANONICAL Forgejo (GitHub tags + # would be wiped by the next mirror sync). Best-effort: never blocks the build. + - name: Compute + tag semver (svu) + env: + FORGEJO_GIT_TOKEN: ${{ secrets.FORGEJO_GIT_TOKEN }} + run: | + set +e + git config user.email "ci@viktorbarzin.me" + git config user.name "{{NAME}}-ci" + git config --global --add safe.directory "$GITHUB_WORKSPACE" + curl -sSL https://github.com/caarlos0/svu/releases/download/v3.4.1/svu_3.4.1_linux_amd64.tar.gz | tar -xz svu + CUR=$(./svu current 2>/dev/null) + NEXT=$(./svu next 2>/dev/null) + echo "svu current=[$CUR] next=[$NEXT]" + if [ -n "$NEXT" ] && [ "$NEXT" != "$CUR" ]; then + git tag "$NEXT" 2>/dev/null + git push "https://viktor:${FORGEJO_GIT_TOKEN}@forgejo.viktorbarzin.me/viktor/{{NAME}}.git" "$NEXT" && echo "pushed tag $NEXT to forgejo" || echo "tag push failed (non-blocking)" + fi + exit 0 + - uses: docker/setup-buildx-action@v4 + - uses: docker/login-action@v4 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - id: meta + run: echo "sha=$(echo ${{ github.sha }} | cut -c1-8)" >> "$GITHUB_OUTPUT" + - uses: docker/build-push-action@v7 + with: + context: {{CONTEXT}} + file: {{DOCKERFILE}} + push: true + platforms: linux/amd64 + # Single-manifest images (no provenance/SBOM attestation children) so + # registry retention can never orphan index children (ADR-0002). + provenance: false + tags: | + ghcr.io/viktorbarzin/{{IMAGE}}:${{ steps.meta.outputs.sha }} + ghcr.io/viktorbarzin/{{IMAGE}}:latest + cache-from: type=gha + cache-to: type=gha,mode=max + # Keep the newest ~10 versions on ghcr (latest rides the newest one). + - name: ghcr retention (keep 10) + uses: actions/delete-package-versions@v5 + continue-on-error: true + with: + package-name: {{IMAGE}} + package-type: container + min-versions-to-keep: 10 + + deploy: + needs: build + runs-on: ubuntu-latest + steps: + # Signal Woodpecker (repo {{WP_REPO_ID}} = ViktorBarzin/{{NAME}} mirror) to run + # .woodpecker/deploy.yml — kubectl set image in-cluster (agent SA is cluster-admin). + - name: Trigger Woodpecker deploy + run: | + for attempt in 1 2 3; do + STATUS=$(curl -s -o /dev/null -w "%{http_code}" -X POST \ + "https://ci.viktorbarzin.me/api/repos/{{WP_REPO_ID}}/pipelines" \ + -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \ + -H "Content-Type: application/json" \ + -d "{\"branch\":\"master\",\"variables\":{\"IMAGE_TAG\":\"${{ needs.build.outputs.image_tag }}\",\"IMAGE_NAME\":\"ghcr.io/viktorbarzin/{{IMAGE}}\"}}") + if [ "$STATUS" -ge 200 ] && [ "$STATUS" -lt 300 ]; then + echo "Woodpecker deploy triggered (HTTP $STATUS)"; exit 0 + fi + echo "Attempt $attempt failed (HTTP $STATUS), retrying in 30s..."; sleep 30 + done + echo "Failed to trigger Woodpecker deploy after 3 attempts"; exit 1 + + notify-failure: + needs: [lint-and-test, build, deploy] + if: failure() + runs-on: ubuntu-latest + steps: + - name: Slack notify + run: | + curl -sf -X POST -H 'Content-Type: application/json' \ + -d "{\"text\":\":rotating_light: {{NAME}} off-infra build FAILED: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\"}" \ + "${{ secrets.SLACK_WEBHOOK }}" || true diff --git a/scripts/offinfra-templates/deploy.yml.tmpl b/scripts/offinfra-templates/deploy.yml.tmpl new file mode 100644 index 00000000..15d5146e --- /dev/null +++ b/scripts/offinfra-templates/deploy.yml.tmpl @@ -0,0 +1,19 @@ +# Auto-deploy, triggered ONLY by the GitHub Actions build POSTing to the +# Woodpecker API (manual event, with IMAGE_TAG + IMAGE_NAME) after a successful +# off-infra build+push to GHCR (ADR-0002). event:[manual] (NOT push) so the +# Forgejo->GitHub mirror's raw pushes don't fire a spurious deploy. +# The woodpecker-agent SA is cluster-admin — no kubeconfig needed. +# Generated by infra/scripts/offinfra-onboard. +when: + - event: manual + +steps: + - name: check-vars + image: alpine + commands: + - "[ -n \"$IMAGE_TAG\" ] || (echo 'IMAGE_TAG not set — refusing to deploy'; exit 1)" + + - name: deploy + image: bitnami/kubectl:latest + commands: +{{DEPLOY_CMDS}} diff --git a/scripts/pfsense-haproxy-bootstrap.php b/scripts/pfsense-haproxy-bootstrap.php index 5452b198..5b9119b2 100644 --- a/scripts/pfsense-haproxy-bootstrap.php +++ b/scripts/pfsense-haproxy-bootstrap.php @@ -45,6 +45,8 @@ $h['maxconn'] = '1000'; // Our declared object names (anything starting with mailserver_ is ours) $POOL_NAMES = [ + 'webgui_traefik_443', // SNI-routed 443: hostname traffic -> Traefik + 'pfsense_webgui_8443', // SNI-routed 443: no-SNI / pfsense.* -> webgui 'mailserver_nodes', // legacy (Phase 2/3 test) 'mailserver_nodes_smtp', 'mailserver_nodes_smtps', @@ -52,6 +54,7 @@ $POOL_NAMES = [ 'mailserver_nodes_imaps', ]; $FRONTEND_NAMES = [ + 'internal_https_443', // SNI-routed internal 443 (2026-06-10) 'mailserver_proxy_test', // legacy (Phase 2/3 test, :2525) 'mailserver_proxy_25', 'mailserver_proxy_465', @@ -185,6 +188,58 @@ $h['ha_pools']['item'][] = build_pool('mailserver_nodes_smtps', '30126', $NODES, $h['ha_pools']['item'][] = build_pool('mailserver_nodes_sub', '30127', $NODES, 'TCP', '30147'); $h['ha_pools']['item'][] = build_pool('mailserver_nodes_imaps', '30128', $NODES); +// ── SNI-routed internal :443 pools (2026-06-10) ───────────────────────── +// Completes the internal port table of 10.0.20.1 so mail.viktorbarzin.me +// (internal A record -> 10.0.20.1) serves webmail too. Routing rule +// (Viktor's design): TLS with a hostname (SNI present) -> Traefik; bare-IP +// /no-SNI (admin hitting https://10.0.20.1) -> pfSense webgui, which moved +// to :8443 to free the socket. pfsense.viktorbarzin.{lan,me} SNI is +// excepted back to the webgui. Traefik leg mirrors the IPv6 bridge: +// send-proxy-v2 (Traefik trusts 10.0.20.1), NO health check (PROXY- +// expecting receivers reject bare probes — see runbook gotcha). +$h['ha_pools']['item'][] = [ + 'name' => 'webgui_traefik_443', + 'balance' => '', + 'check_type' => 'none', + 'monitor_domain' => '', + 'checkinter' => '', + 'retries' => '', + 'ha_servers' => ['item' => [[ + 'name' => 'traefik', + 'address' => '10.0.20.203', + 'port' => '443', + 'weight' => '10', + 'ssl' => '', + 'advanced' => 'send-proxy-v2', + 'status' => 'active', + ]]], + 'advanced_bind' => '', + 'persist_cookie_enabled' => '', + 'transparent_clientip' => '', + 'advanced' => '', +]; +$h['ha_pools']['item'][] = [ + 'name' => 'pfsense_webgui_8443', + 'balance' => '', + 'check_type' => 'none', + 'monitor_domain' => '', + 'checkinter' => '', + 'retries' => '', + 'ha_servers' => ['item' => [[ + 'name' => 'webgui', + 'address' => '127.0.0.1', + 'port' => '8443', + 'weight' => '10', + 'ssl' => '', + 'advanced' => '', + 'status' => 'active', + ]]], + 'advanced_bind' => '', + 'persist_cookie_enabled' => '', + 'transparent_clientip' => '', + 'advanced' => '', +]; + // ── Frontends ─────────────────────────────────────────────────────────── if (!is_array($h['ha_backends'])) $h['ha_backends'] = ['item' => []]; if (!is_array($h['ha_backends']['item'])) $h['ha_backends']['item'] = []; @@ -228,7 +283,36 @@ $h['ha_backends']['item'][] = build_frontend( 'mailserver_nodes_imaps' ); -write_config('code-yiu: mailserver HAProxy — 4 production frontends + legacy :2525 test'); +// ── SNI-routed internal :443 frontend (2026-06-10) ────────────────────── +// Binds both internal interface IPs so IP-based GUI access works from +// either VLAN. mode tcp + SNI inspection; TLS passthrough on both legs +// (Traefik serves the real certs; the webgui keeps its self-signed one). +$h['ha_backends']['item'][] = [ + 'name' => 'internal_https_443', + 'descr' => 'SNI-routed internal 443: hostname->Traefik (proxy-v2), no-SNI/pfsense.*->webgui:8443', + 'status' => 'active', + 'secondary' => '', + 'type' => 'tcp', + 'a_extaddr' => ['item' => [ + ['extaddr' => 'custom', 'extaddr_custom' => '10.0.20.1', 'extaddr_port' => '443', 'extaddr_ssl' => '', 'extaddr_advanced' => ''], + ['extaddr' => 'custom', 'extaddr_custom' => '10.0.10.1', 'extaddr_port' => '443', 'extaddr_ssl' => '', 'extaddr_advanced' => ''], + ]], + 'backend_serverpool' => 'pfsense_webgui_8443', + 'ha_acls' => ['item' => [ + ['name' => 'sni_pfsense', 'expression' => 'custom', 'value' => 'req.ssl_sni -i -m str pfsense.viktorbarzin.lan pfsense.viktorbarzin.me', 'casesensitive' => '', 'not' => ''], + ['name' => 'sni_any', 'expression' => 'custom', 'value' => 'req.ssl_sni -m found', 'casesensitive' => '', 'not' => ''], + ]], + 'a_actionitems' => ['item' => [ + ['action' => 'use_backend', 'use_backendbackend' => 'pfsense_webgui_8443', 'acl' => 'sni_pfsense'], + ['action' => 'use_backend', 'use_backendbackend' => 'webgui_traefik_443', 'acl' => 'sni_any'], + ]], + 'dontlognull'=> '', + 'httpclose' => '', + 'forwardfor' => '', + 'advanced' => base64_encode("tcp-request inspect-delay 5s\n\ttcp-request content accept if { req.ssl_hello_type 1 } || !{ req.ssl_hello_type 1 }"), +]; + +write_config('mailserver HAProxy + SNI-routed internal 443 (hostname->Traefik, no-SNI->webgui:8443)'); $messages = ''; $rc = haproxy_check_and_run($messages, true); diff --git a/scripts/publish-gate b/scripts/publish-gate new file mode 100755 index 00000000..0bbb6fb5 --- /dev/null +++ b/scripts/publish-gate @@ -0,0 +1,64 @@ +#!/usr/bin/env bash +# publish-gate — gate a Canonical repo's PUBLIC flip (ADR-0002). +# A repo may go public ONLY on a CLEAN verdict; a DIRTY verdict means it stays +# private — canonical history is never rewritten for publication. +# +# Checks (full git history, not just the worktree): +# 1. gitleaks — secret patterns across all commits +# 2. trufflehog (docker) — verified-credential detection across all commits +# 3. PII heuristics — emails/phones/keys in tracked files + fixture inventory +# +# Usage: publish-gate <clone-path> +# Exit: 0 = CLEAN, 1 = DIRTY, 2 = scanner error. Report: /tmp/publish-gate-<name>.txt +set -uo pipefail +CLONE=${1:?usage: publish-gate <clone-path>} +CLONE=$(cd "$CLONE" && pwd) +NAME=$(basename "$CLONE") +REPORT="/tmp/publish-gate-$NAME.txt" +DIRTY=0; ERR=0 + +say() { echo "$@" | tee -a "$REPORT"; } +: > "$REPORT" +say "== publish-gate: $NAME @ $(git -C "$CLONE" rev-parse --short HEAD) ($(date -u +%FT%TZ)) ==" + +# --- 1. gitleaks (full history) --- +say ""; say "-- gitleaks (full history) --" +if gitleaks git "$CLONE" --no-banner --redact --report-path /tmp/publish-gate-$NAME-gitleaks.json >>"$REPORT" 2>&1; then + say "gitleaks: CLEAN" +else + rc=$? + if [ "$rc" = 1 ]; then say "gitleaks: LEAKS FOUND (see $REPORT + json)"; DIRTY=1 + else say "gitleaks: scanner error rc=$rc"; ERR=1; fi +fi + +# --- 2. trufflehog (verified credentials, full history) --- +say ""; say "-- trufflehog (verified only, full history) --" +if docker run --rm -v "$CLONE":/repo:ro trufflesecurity/trufflehog:latest \ + git file:///repo --only-verified --fail --no-update >>"$REPORT" 2>&1; then + say "trufflehog: CLEAN (no verified credentials)" +else + rc=$? + if [ "$rc" = 183 ]; then say "trufflehog: VERIFIED CREDENTIALS FOUND"; DIRTY=1 + else say "trufflehog: scanner error rc=$rc"; ERR=1; fi +fi + +# --- 3. PII heuristics on tracked files --- +say ""; say "-- PII heuristics (tracked files) --" +cd "$CLONE" +EMAILS=$(git grep -hoiE '[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}' -- ':!*.lock' ':!package-lock.json' ':!pnpm-lock.yaml' ':!.beads' 2>/dev/null \ + | grep -viE '@(viktorbarzin\.me|meta\.com|example\.(com|org|test)|test\.(com|local)|localhost|users\.noreply\.github\.com|googlegroups\.com)' \ + | grep -viE '^(noreply|no-reply|ci|admin|info|support|hello|user|foo|bar|test.*|licensing|legal|security|sales)@' \ + | sort -u | head -20) +if [ -n "$EMAILS" ]; then say "real-looking emails found:"; say "$EMAILS"; say "(review: PII?)"; DIRTY=1; else say "emails: none beyond allowlist"; fi +KEYS=$(git grep -l 'BEGIN.*PRIVATE KEY' 2>/dev/null | head -5) +[ -n "$KEYS" ] && { say "PRIVATE KEY blocks in: $KEYS"; DIRTY=1; } || say "private keys: none" +ENVF=$(git ls-files | grep -E '(^|/)\.env($|\.)' | head -5) +[ -n "$ENVF" ] && { say "committed .env files: $ENVF (review)"; DIRTY=1; } || say ".env files: none" +FIXTURES=$(git ls-files | grep -iE '(fixtures?|testdata|tests?/data|^\.beads)/' | head -10) +if [ -n "$FIXTURES" ]; then say "fixture files present (eyeball for PII):"; say "$FIXTURES"; else say "fixtures: none"; fi + +say "" +if [ "$ERR" = 1 ]; then say "VERDICT: ERROR (scanner failed — fix and re-run)"; exit 2; fi +if [ "$DIRTY" = 1 ]; then say "VERDICT: DIRTY — repo stays PRIVATE (do not rewrite history)"; exit 1; fi +say "VERDICT: CLEAN — public flip approved" +exit 0 diff --git a/scripts/pve-promtail.service b/scripts/pve-promtail.service new file mode 100644 index 00000000..0b288bfc --- /dev/null +++ b/scripts/pve-promtail.service @@ -0,0 +1,17 @@ +# systemd unit for promtail on the PVE host (192.168.1.127). Install to +# /etc/systemd/system/promtail.service. See scripts/pve-promtail.yaml for the full deploy. +[Unit] +Description=Promtail (ships PVE host journal -> cluster Loki) +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/config.yml +Restart=on-failure +RestartSec=5 +User=root +Group=root + +[Install] +WantedBy=multi-user.target diff --git a/scripts/pve-promtail.yaml b/scripts/pve-promtail.yaml new file mode 100644 index 00000000..b923e98c --- /dev/null +++ b/scripts/pve-promtail.yaml @@ -0,0 +1,53 @@ +# Promtail config for the PVE host (192.168.1.127) — ships the systemd journal to cluster Loki. +# +# NOT Terraform-managed (the PVE host is the hypervisor, outside k8s). Deployed by hand, +# same pattern as scripts/fan-control.* and the rpi-sofia promtail. This file is source-of-truth. +# +# Deploy: +# scp scripts/pve-promtail.yaml root@192.168.1.127:/etc/promtail/config.yml +# scp scripts/pve-promtail.service root@192.168.1.127:/etc/systemd/system/promtail.service +# ssh root@192.168.1.127 'mkdir -p /var/lib/promtail && systemctl daemon-reload && systemctl enable --now promtail' +# # Binary: grafana/loki v3.5.1 promtail-linux-amd64 -> /usr/local/bin/promtail (chmod 0755). +# # Loki reach: loki.viktorbarzin.lan resolves via a Technitium CNAME -> ingress.viktorbarzin.lan +# # (registered 2026-06-10 via the Technitium API; auto-tracks the live Traefik LB IP, AXFR'd to all +# # 3 instances). NO /etc/hosts pin. insecure_skip_verify stays — the internal .lan cert isn't trusted. +# +# Streams produced: +# {job="pve-journal"} — full host journal (filter identifier="snoopy" for the command audit) +# {job="sshd-pve"} — sshd auth lines; feeds the Loki S1 security rule (docs/architecture/security.md) +# {job="pve-journal", identifier="snoopy"} — snoopy command audit (every execve on the host; see scripts/pve-snoopy.ini) +server: + http_listen_port: 9080 + grpc_listen_port: 0 + log_level: warn + +positions: + filename: /var/lib/promtail/positions.yaml + +clients: + - url: https://loki.viktorbarzin.lan/loki/api/v1/push + tls_config: + insecure_skip_verify: true + +scrape_configs: + - job_name: journal + journal: + max_age: 12h + json: false + path: /var/log/journal + labels: + host: pve + job: pve-journal + relabel_configs: + - source_labels: ['__journal__systemd_unit'] + target_label: unit + - source_labels: ['__journal_priority_keyword'] + target_label: level + - source_labels: ['__journal_syslog_identifier'] + target_label: identifier + # sshd auth lines (identifier sshd / sshd-session) -> job=sshd-pve so the Loki S1 + # security rule ({job="sshd-pve"}) matches. snoopy command lines stay job=pve-journal. + - source_labels: ['__journal_syslog_identifier'] + regex: 'sshd.*' + target_label: job + replacement: 'sshd-pve' diff --git a/scripts/pve-snoopy.ini b/scripts/pve-snoopy.ini new file mode 100644 index 00000000..931bc29d --- /dev/null +++ b/scripts/pve-snoopy.ini @@ -0,0 +1,21 @@ +; snoopy config for the PVE host (192.168.1.127) — logs every execve() to journald. +; +; Install to /etc/snoopy.ini. Enable globally by adding the lib to /etc/ld.so.preload: +; apt-get install -y snoopy +; echo /usr/lib/x86_64-linux-gnu/libsnoopy.so > /etc/ld.so.preload # enable (no snoopy-enable in the Debian pkg) +; # disable/rollback: truncate -s 0 /etc/ld.so.preload (or remove the line) +; +; output=devlog writes directly to /dev/log -> journald (identifier "snoopy"). +; DO NOT use output=syslog on a systemd host — snoopy's own docs warn it can hang the system on boot. +; +; Shipped to Loki by promtail as {job="pve-journal", identifier="snoopy"} (scripts/pve-promtail.yaml). +; Attribution note: all sessions run as root (shared root key), so uid/login are always root; +; correlate a command's sid/time with the matching {job="sshd-pve"} "Accepted publickey ... SHA256:<fp>" +; line to attribute it to a person (e.g. emo's agent key fp SHA256:Wd+m0EABlm4RDDykDh85PIYSqe0Al8Hr9AZ+7Ksy4HQ). +[snoopy] +output = devlog +message_format = "snoopy uid=%{uid} login=%{login} tty=%{tty} sid=%{sid} cwd=%{cwd} : %{cmdline}" +syslog_ident = snoopy +syslog_facility = LOG_AUTHPRIV +syslog_level = LOG_INFO +filter_chain = "" diff --git a/scripts/setup-forgejo-containerd-mirror.sh b/scripts/setup-forgejo-containerd-mirror.sh index 975c2aa2..4c543d15 100755 --- a/scripts/setup-forgejo-containerd-mirror.sh +++ b/scripts/setup-forgejo-containerd-mirror.sh @@ -1,11 +1,24 @@ #!/usr/bin/env bash -# One-shot deployment of the forgejo.viktorbarzin.me containerd hosts.toml -# entry across every k8s node. Cloud-init only fires on VM provision, so -# existing nodes need this manual rollout. +# One-shot deployment of the (vestigial) forgejo containerd hosts.toml entry +# across every k8s node, plus cleanup of legacy node-side DNS customization. +# Cloud-init only fires on VM provision, so existing nodes need this manual +# rollout. +# +# Node DNS is intentionally STOCK: internal split-horizon for +# *.viktorbarzin.me happens at pfSense Unbound (domain override -> +# Technitium), whose split-horizon zone serves the live Traefik LB IP for +# every ingress host — nodes need no resolved drop-ins or /etc/hosts pins. +# The hosts.toml mirror alone CANNOT keep pulls internal: Traefik 404s its +# bare-IP requests (no Host/SNI match) and the registry Bearer auth realm is +# the absolute public URL fetched outside the mirror (2026-06-10 tuya-bridge +# outage; see docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md). # # What it does, per node: # 1. drain (ignore-daemonsets, delete-emptydir-data) -# 2. ssh in: mkdir + write /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml +# 2. ssh in: remove legacy DNS customization (forgejo-internal-pin +# /etc/hosts lines, viktorbarzin.conf / global-dns.conf resolved +# drop-ins), restart systemd-resolved, +# write /etc/containerd/certs.d/forgejo.viktorbarzin.me/hosts.toml # 3. systemctl restart containerd # 4. uncordon # @@ -38,6 +51,10 @@ for n in $NODES; do ssh -o StrictHostKeyChecking=accept-new "wizard@$n" sudo bash <<EOF set -euo pipefail +sed -i '/forgejo-internal-pin/d' /etc/hosts +rm -f /etc/systemd/resolved.conf.d/viktorbarzin.conf \ + /etc/systemd/resolved.conf.d/global-dns.conf +systemctl restart systemd-resolved mkdir -p "$CERTS_DIR" cat > "$CERTS_DIR/hosts.toml" <<'TOML' $HOSTS_TOML diff --git a/scripts/sshd-10-breakglass.conf b/scripts/sshd-10-breakglass.conf new file mode 100644 index 00000000..96663d2b --- /dev/null +++ b/scripts/sshd-10-breakglass.conf @@ -0,0 +1,31 @@ +# Break-glass SSH drop-in (redesigned 2026-06-11). Source of truth. +# Deploy to the PVE host with: +# scp scripts/sshd-10-breakglass.conf root@192.168.1.127:/etc/ssh/sshd_config.d/10-breakglass.conf +# ssh root@192.168.1.127 'sshd -t && systemctl reload ssh' +# +# :22 = LAN admin, all of root's keys (default AuthorizedKeysFile). +# :52222 = WAN-exposed break-glass. The edge router forwards WAN tcp/52222 -> +# 192.168.1.127:52222 (external port MUST equal internal port on the +# TP-Link AX6000 — it rejects remaps; port 22 itself is reserved). +# The Match LocalPort block trusts ONLY the dedicated break-glass key +# (authorized_keys.breakglass), so a leak of any other root key does +# NOT grant internet access. Rate-limited by the BREAKGLASS iptables +# chain + fail2ban. No port-knock. +# +# NOTE: the trailing `Match all` is REQUIRED. /etc/ssh/sshd_config has +# `Include sshd_config.d/*.conf` near the top but a global `PermitRootLogin` +# further down; without `Match all` resetting context, that later global +# directive would be swallowed into the `Match LocalPort 52222` condition. +Port 22 +Port 52222 +PasswordAuthentication no +KbdInteractiveAuthentication no +PubkeyAuthentication yes +PermitRootLogin prohibit-password +MaxAuthTries 3 +LoginGraceTime 20 + +Match LocalPort 52222 + AuthorizedKeysFile /root/.ssh/authorized_keys.breakglass + PermitRootLogin prohibit-password +Match all diff --git a/scripts/t3-autoupdate.service b/scripts/t3-autoupdate.service index d3306da7..7b043f13 100644 --- a/scripts/t3-autoupdate.service +++ b/scripts/t3-autoupdate.service @@ -1,5 +1,5 @@ [Unit] -Description=Track latest t3 nightly (health-checked, idle-only restart) +Description=Enforce pinned t3 version (health-checked, idle-only restart) After=network-online.target Wants=network-online.target diff --git a/scripts/t3-autoupdate.sh b/scripts/t3-autoupdate.sh index 962f3fc4..a3928211 100644 --- a/scripts/t3-autoupdate.sh +++ b/scripts/t3-autoupdate.sh @@ -1,49 +1,185 @@ #!/usr/bin/env bash -# Track the latest t3 nightly — with a health-check + auto-rollback (lesson from -# the Keel auto-update incidents: never blindly trust a new build) and idle-only -# restarts (never kill an in-flight coding session). Runs as root via the unit. +# t3 GATED NIGHTLY TRACKER (daily, via t3-autoupdate.timer). +# +# t3 is pre-1.0 and ships breaking schema-migration + pairing-API changes between +# builds. On 2026-06-09 a blind `npm i -g t3@nightly` migrated every ~/.t3 +# state.sqlite FORWARD and moved the bootstrap API, breaking pairing for ALL users +# with no alert (post-mortem 2026-06-09-t3-nightly-autoupdate-auth-outage.md). We +# pinned in response. +# +# 2026-06-16 (Viktor's call, risk explicitly accepted): re-enable nightly tracking, +# but GATED so a bad nightly self-heals instead of breaking everyone. This script +# now follows the `nightly` npm dist-tag (T3_TRACK) under these guards: +# - freeze switch (/etc/t3-autoupdate.freeze) + optional hard pin (T3_PIN) for +# instant manual revert; a canary failure also self-freezes; +# - downgrade-guard (the nightly tag is mutable — never move backward); +# - pre-bump per-user state.sqlite backup BEFORE install (rollback => restore, +# not sqlite surgery), via the same online VACUUM INTO as t3-backup-state; +# - a health-check that seeds a throwaway instance with a COPY of a real +# POPULATED state.sqlite, so it exercises the forward MIGRATION (the actual +# 2026-06-09 failure class) + the real pairing handshake before trusting a build; +# - canary rollout: restart idle instances ONE AT A TIME, verifying pairing +# through the real dispatch after each, and roll back (binary + that user's DB) +# + self-freeze on the first failure — active-agent instances are deferred, +# never killed (deferred instances are recorded for t3-migrate-idle to drain); +# - rollback target is the recorded LAST-GOOD build, not "whatever was installed". +# Detection backstop (real-user pairing failure/fallback) lives in the dispatch +# logs + Loki alerts (T3PairingBroken / T3PairFallbackHigh / T3AutoUpdate*). +# To stop tracking: `sudo touch /etc/t3-autoupdate.freeze` (or set T3_PIN=<ver>). +# Full procedure + manual rollback: docs/runbooks/t3-version-bump.md. set -uo pipefail -LOG() { logger -t t3-autoupdate "$*"; echo "t3-autoupdate: $*"; } -ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; } +# ---- autoupdate-specific config (shared config + helpers come from the lib) ----- +T3_TRACK="${T3_TRACK:-nightly}" # npm dist-tag to follow (nightly | latest) +T3_PIN="${T3_PIN:-}" # optional HARD pin to an exact version (disables tracking) +SMOKE_PORT="${T3_SMOKE_PORT:-3799}" +DRY_RUN="${T3_DRY_RUN:-0}" +TMPROOT="${T3_TMPDIR:-/var/tmp}" # health-check scratch on DISK — /tmp is a 2G tmpfs and a populated state.sqlite (~hundreds of MB) overflows it -before=$(ver); LOG "current: ${before:-unknown}" -npm i -g t3@nightly >/dev/null 2>&1 || { LOG "npm install failed; staying on ${before:-current}"; exit 0; } -after=$(ver) +LOG_TAG=t3-autoupdate +# shellcheck source=scripts/t3-safe-restart.sh +. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}" -if [[ -z "$after" || "$after" == "$before" ]]; then - LOG "already latest (${before:-?}); nothing to do"; exit 0 +# is $1 a strictly-newer version than $2 (version-sort)? +newer() { [ "$1" != "$2" ] && [ "$(printf '%s\n%s\n' "$1" "$2" | sort -V | tail -1)" = "$1" ]; } + +mkdir -p "$STATE_DIR" 2>/dev/null || true + +# ---- 0. freeze gate ------------------------------------------------------------- +if [ -e "$FREEZE_FILE" ]; then + LOG "FROZEN: $FREEZE_FILE present — holding at $(ver), not tracking $T3_TRACK"; exit 0 fi -LOG "installed $after (was $before); health-checking…" -# Health-check the NEW binary on a throwaway port/base-dir before trusting it. -SMOKE_PORT=3799; SMOKE_DIR=$(mktemp -d) -t3 serve --host 127.0.0.1 --port "$SMOKE_PORT" --base-dir "$SMOKE_DIR" >/dev/null 2>&1 & -smoke=$!; ok=0 -for _ in $(seq 1 15); do - [[ "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$SMOKE_PORT/" 2>/dev/null)" == "200" ]] && { ok=1; break; } - sleep 2 -done -kill "$smoke" 2>/dev/null; wait "$smoke" 2>/dev/null; rm -rf "$SMOKE_DIR" +current="$(ver)" +[ -n "$current" ] || { LOG "cannot read current t3 version — aborting (is t3 installed?)"; exit 0; } +[ -s "$LAST_GOOD_FILE" ] || echo "$current" >"$LAST_GOOD_FILE" # seed last-good on first run +last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" +[ -n "$last_good" ] || last_good="$current" -if [[ "$ok" != "1" ]]; then - LOG "HEALTH-CHECK FAILED for $after — rolling back to $before" - if [[ -n "$before" ]] && npm i -g "t3@$before" >/dev/null 2>&1; then - LOG "rolled back to $before" - else - LOG "ROLLBACK FAILED — manual fix needed (t3 may be broken)" +# ---- 1. resolve target ---------------------------------------------------------- +if [ -n "$T3_PIN" ]; then + target="$T3_PIN" + LOG "T3_PIN=$T3_PIN set — enforcing pin (tracking disabled)" +else + target="$(npm view "t3@$T3_TRACK" version 2>/dev/null | tail -1 | tr -d '[:space:]')" + [ -n "$target" ] || { LOG "could not resolve t3@$T3_TRACK from npm — staying on $current"; exit 0; } +fi + +[ "$target" = "$current" ] && { LOG "already on $T3_TRACK=$current; nothing to do"; exit 0; } + +# ---- 2. downgrade + channel guard (mutable nightly tag can point backward) ------ +if [ -z "$T3_PIN" ]; then + newer "$target" "$current" || { LOG "resolved $T3_TRACK=$target is NOT newer than installed $current — refusing downgrade"; exit 0; } + if [ "$T3_TRACK" = "nightly" ]; then + case "$target" in *-nightly.*) : ;; *) LOG "resolved nightly target '$target' is not a nightly build — refusing"; exit 0;; esac fi - exit 1 fi -LOG "health OK; restarting idle instances" +LOG "candidate: $current -> $target (track=$T3_TRACK, last_good=$last_good, dry_run=$DRY_RUN)" -# Restart only IDLE per-user instances; defer any with an active agent child. -for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' | awk '{print $1}'); do - pid=$(systemctl show -p MainPID --value "$unit") - if [[ -n "$pid" && "$pid" != 0 ]] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode'; then - LOG "deferring $unit (active agent) — updates next cycle when idle" +# ---- helpers: backup, health-check, rollback, restart-verify -------------------- +# Online consistent per-user snapshot (run AS the owner so WAL stays owned; never +# stops the serve). Sets $ADMIN_SEED to wizard's backup for the migration health +# check. Mirrors t3-backup-state.sh. (backup_user lives in the shared lib.) +ADMIN_SEED="" +backup_all() { + local u dst + for u in $(osusers); do + if dst="$(backup_user "$u")"; then + LOG "pre-bump backup: $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)" + [ "$u" = "wizard" ] && ADMIN_SEED="$dst" + else + LOG "WARN: pre-bump backup FAILED for $u (/home/$u/.t3/userdata/state.sqlite)" + fi + done + [ -n "$ADMIN_SEED" ] || ADMIN_SEED="$(ls -1t "$BACKUP_DIR"/*/"state-prebump-$target-"*.sqlite 2>/dev/null | head -1)" +} + +# health_check <t3bin> [seed_db]: start a throwaway serve (seeded with a copy of a +# real populated DB if given, so the forward migration runs on real data), then do +# the real mint -> credential-exchange -> t3_session pairing handshake with the +# dispatch's endpoint fallback, and sniff the serve log for a migration failure. +health_check() { + local t3bin="$1" seed="${2:-}" dir logf pid live=0 pair=0 migerr=0 cred ep hdr code seeded=fresh + dir="$(mktemp -d -p "$TMPROOT")"; mkdir -p "$dir/userdata"; logf="$dir/serve.log" + if [ -n "$seed" ] && [ -f "$seed" ]; then cp "$seed" "$dir/userdata/state.sqlite"; seeded=populated; fi + "$t3bin" serve --host 127.0.0.1 --port "$SMOKE_PORT" --base-dir "$dir" >"$logf" 2>&1 & + pid=$! + for _ in $(seq 1 15); do + [ "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://127.0.0.1:$SMOKE_PORT/" 2>/dev/null)" = "200" ] && { live=1; break; } + sleep 2 + done + if [ "$live" = "1" ]; then + cred="$("$t3bin" auth pairing create --base-dir "$dir" --ttl 5m --json 2>/dev/null | tr -d '\n ' | sed -n 's/.*"credential":"\([^"]*\)".*/\1/p')" + if [ -n "$cred" ]; then + for ep in /api/auth/browser-session /api/auth/bootstrap; do + hdr="$(curl -s -i --max-time 5 -X POST -H 'Content-Type: application/json' -d "{\"credential\":\"$cred\"}" "http://127.0.0.1:$SMOKE_PORT$ep" 2>/dev/null)" + code="$(printf '%s' "$hdr" | sed -n '1s#.* \([0-9][0-9][0-9]\).*#\1#p')" + [ "$code" = "404" ] && continue + printf '%s' "$hdr" | grep -qi '^set-cookie:[[:space:]]*t3_session=' && pair=1 + break + done + fi + fi + grep -qiE 'migration failed|failed to migrate|no column named|NOT NULL constraint failed|PersistenceSqlError' "$logf" 2>/dev/null && migerr=1 + kill "$pid" 2>/dev/null; wait "$pid" 2>/dev/null + if [ "$live" = "1" ] && [ "$pair" = "1" ] && [ "$migerr" = "0" ]; then + LOG "health OK ($seeded: live + pairing handshake + clean migration)" + rm -rf "$dir"; return 0 + fi + LOG "HEALTH-CHECK FAILED ($seeded: live=$live pair=$pair migerr=$migerr); serve log: $(tail -3 "$logf" 2>/dev/null | tr '\n' '|')" + rm -rf "$dir"; return 1 +} + +# is this t3-serve@<unit> running an active agent (claude/codex/opencode)? never restart those. +unit_busy() { + local unit="$1" pid; pid="$(systemctl show -p MainPID --value "$unit" 2>/dev/null)" + [ -n "$pid" ] && [ "$pid" != "0" ] && pgrep -aP "$pid" 2>/dev/null | grep -qiE 'claude|codex|opencode' +} + +# ---- 3. DRY RUN: preview only (install candidate to temp prefix, gate it) ------- +if [ "$DRY_RUN" = "1" ]; then + LOG "DRY_RUN: would back up [$(osusers | tr '\n' ' ')]; testing candidate $target in a temp prefix (no global change, no restarts)" + tmp="$(mktemp -d -p "$TMPROOT")" + if npm i --prefix "$tmp" "t3@$target" >/dev/null 2>&1; then + seed="$(ls -1t "$BACKUP_DIR/wizard/state-"*.sqlite 2>/dev/null | head -1)" # reuse any existing backup as seed + if health_check "$tmp/node_modules/.bin/t3" "$seed"; then LOG "DRY_RUN: candidate $target PASSED the gate"; else LOG "DRY_RUN: candidate $target FAILED the gate"; fi else - systemctl restart "$unit" && LOG "restarted $unit -> $after" + LOG "DRY_RUN: npm could not fetch t3@$target" + fi + rm -rf "$tmp"; exit 0 +fi + +# ---- 4. pre-bump backup, then install ------------------------------------------- +backup_all +if ! npm i -g "t3@$target" >/dev/null 2>&1; then + LOG "npm install of t3@$target FAILED — staying on $current"; exit 0 +fi +installed="$(ver)" +[ "$installed" = "$target" ] || { LOG "post-install version is $installed, expected $target — rolling back"; rollback_binary; exit 1; } + +# ---- 5. gate the new binary on a POPULATED-DB migration + pairing --------------- +if ! health_check "$(command -v t3)" "$ADMIN_SEED"; then + rollback_binary; exit 1 # nothing restarted yet -> binary rollback is clean +fi +LOG "health gate passed for $target; canary-restarting idle instances one at a time" + +# ---- 6. canary rollout: idle instances one-by-one, verify pairing after each ---- +restarted=0; deferred=0 +for unit in $(systemctl list-units --type=service --state=running --no-legend 't3-serve@*' 2>/dev/null | awk '{print $1}'); do + u="$(printf '%s' "$unit" | sed -n 's/^t3-serve@\(.*\)\.service$/\1/p')"; [ -n "$u" ] || continue + if unit_busy "$unit"; then + LOG "deferring $unit (active agent) — migrates on its next idle restart" + mkdir -p "$DEFER_DIR" 2>/dev/null && printf '%s\n' "$target" >"$DEFER_DIR/$u" # record for t3-migrate-idle + deferred=$((deferred+1)); continue + fi + if safe_restart_unit "$unit" "$u"; then + restarted=$((restarted+1)) + rm -f "$DEFER_DIR/$u" 2>/dev/null # now current — clear any stale marker + else + exit 1 # frozen by safe_restart_unit — preserve today's behavior fi done -LOG "update complete: $after" + +# ---- 7. success: advance last-good ---------------------------------------------- +echo "$target" >"$LAST_GOOD_FILE" +LOG "update complete: $target (restarted=$restarted deferred=$deferred); last_good now $target" diff --git a/scripts/t3-autoupdate.timer b/scripts/t3-autoupdate.timer index a59135f7..65f1635a 100644 --- a/scripts/t3-autoupdate.timer +++ b/scripts/t3-autoupdate.timer @@ -1,10 +1,13 @@ [Unit] -Description=Daily t3 nightly auto-update +Description=Daily gated t3 nightly tracker (health-checked + canary + auto-rollback) [Timer] OnCalendar=*-*-* 04:00:00 RandomizedDelaySec=1h -Persistent=true +# Persistent deliberately OMITTED: this now installs a NEW build + migrates DBs + +# restarts serves, so a missed 04:00 run must NOT fire on boot mid-day with users +# active (a 2026-06-09 contributing factor). Skipping a day is fine — the next +# 04:00 picks up the latest nightly. [Install] WantedBy=timers.target diff --git a/scripts/t3-backup-state.service b/scripts/t3-backup-state.service new file mode 100644 index 00000000..5f590942 --- /dev/null +++ b/scripts/t3-backup-state.service @@ -0,0 +1,6 @@ +[Unit] +Description=Consistent backup of per-user t3 ~/.t3 state.sqlite (history + auth) + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/t3-backup-state diff --git a/scripts/t3-backup-state.sh b/scripts/t3-backup-state.sh new file mode 100644 index 00000000..7d9f4cd1 --- /dev/null +++ b/scripts/t3-backup-state.sh @@ -0,0 +1,47 @@ +#!/usr/bin/env bash +# Consistent online backup of each t3 user's ~/.t3 state.sqlite (chat/session +# history AND auth tables). ~/.t3 lives on the devvm local disk — NOT a K8s PVC and +# NOT in the 3-2-1 pipeline — so without this it is the only copy and a rebuild +# loses it. It also makes a t3 version bump REVERSIBLE: 0.0.25+ migrate the schema +# FORWARD (a one-way door), so a clean pre-bump backup turns rollback into a restore +# instead of per-user sqlite surgery (see runbooks/t3-version-bump.md). Runs as root +# via t3-backup-state.timer; the per-user .backup runs AS the owning user so the live +# WAL/-shm files keep their owner and the running t3-serve is never perturbed. +set -uo pipefail +DEST="${T3_BACKUP_DEST:-/var/backups/t3-state}" +# 6 (was 14): wizard's state.sqlite grew to ~1.1GB, and the gated nightly tracker +# adds a pre-bump snapshot per bump on top of this daily one — 14 x ~1.1GB would +# fill the devvm root fs. 6 is ample (rollback only ever needs the most recent +# pre-bump backup). Bump per user via T3_BACKUP_KEEP if a DB is small. +KEEP="${T3_BACKUP_KEEP:-6}" +MAP=/etc/ttyd-user-map +LOG() { logger -t t3-backup-state "$*"; echo "t3-backup-state: $*"; } + +ts=$(date +%Y%m%d-%H%M%S) +# RHS of each non-comment "authentik=os_user" line = an OS user owning a ~/.t3. +mapfile -t users < <(awk -F= '!/^[[:space:]]*#/ && NF==2 { gsub(/[[:space:]]/,"",$2); print $2 }' "$MAP" 2>/dev/null | sort -u) +[[ ${#users[@]} -gt 0 ]] || { LOG "no users in $MAP; nothing to back up"; exit 0; } + +rc=0 +for u in "${users[@]}"; do + src="/home/$u/.t3/userdata/state.sqlite" + if [[ ! -f "$src" ]]; then LOG "skip $u (no state.sqlite)"; continue; fi + out="$DEST/$u"; dst="$out/state-$ts.sqlite" + install -d -o "$u" -g "$u" -m 0700 "$out" + # VACUUM INTO takes a consistent read-snapshot copy — unlike .backup it does NOT + # restart when the source is written mid-copy, so it finishes in a single pass even + # for the actively-used instance (the admin's own live session, which .backup would + # loop on forever). Run as the owning user so WAL access keeps the live serve happy. + # timeout caps a pathologically-slow copy (huge DB + concurrent writes on a contended + # disk) so the daily run can never wedge — it just logs + retries next cycle. The + # daily 03:30 slot normally finds instances idle, where even a large DB copies fast. + if runuser -u "$u" -- timeout "${T3_BACKUP_TIMEOUT:-900}" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [[ -s "$dst" ]]; then + LOG "backed up $u -> $dst ($(stat -c%s "$dst" 2>/dev/null) bytes)" + else + LOG "WARN: backup FAILED for $u ($src)"; rc=1; rm -f "$dst" + fi + # retention: keep newest $KEEP per user + ls -1t "$out"/state-*.sqlite 2>/dev/null | tail -n +$((KEEP+1)) | xargs -r rm -f +done +LOG "done (rc=$rc)" +exit $rc diff --git a/scripts/t3-backup-state.timer b/scripts/t3-backup-state.timer new file mode 100644 index 00000000..72ac48e5 --- /dev/null +++ b/scripts/t3-backup-state.timer @@ -0,0 +1,10 @@ +[Unit] +Description=Daily t3 state.sqlite backup (the only copy of ~/.t3; enables version-bump rollback) + +[Timer] +OnCalendar=*-*-* 03:30:00 +RandomizedDelaySec=20m +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/scripts/t3-dispatch/go.mod b/scripts/t3-dispatch/go.mod index 26e9388b..3a0fef0f 100644 --- a/scripts/t3-dispatch/go.mod +++ b/scripts/t3-dispatch/go.mod @@ -1,3 +1,5 @@ module t3-dispatch go 1.22 + +require github.com/gorilla/websocket v1.5.3 diff --git a/scripts/t3-dispatch/go.sum b/scripts/t3-dispatch/go.sum new file mode 100644 index 00000000..25a9fc4b --- /dev/null +++ b/scripts/t3-dispatch/go.sum @@ -0,0 +1,2 @@ +github.com/gorilla/websocket v1.5.3 h1:saDtZ6Pbx/0u+bgYQ3q96pZgCzfhKXGPqt7kZ72aNNg= +github.com/gorilla/websocket v1.5.3/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE= diff --git a/scripts/t3-dispatch/main.go b/scripts/t3-dispatch/main.go index cebcc119..05e59304 100644 --- a/scripts/t3-dispatch/main.go +++ b/scripts/t3-dispatch/main.go @@ -59,14 +59,103 @@ func lookup(ak string) (entry, bool) { return e, ok } +// mintToken mints a one-time pairing token for osUser via the scoped sudoers +// entry (the dispatch service can invoke nothing else). Indirected through a var +// so tests can stub the privileged exec. +var mintToken = func(osUser string) ([]byte, error) { + return exec.Command("sudo", "-n", "/usr/local/bin/t3-mint", osUser).Output() +} + +var sessionClient = &http.Client{Timeout: 5 * time.Second} + +// sessionValid asks the user's instance whether the presented t3_session cookie +// is still valid. Server-side sessions can be wiped/expired independently of the +// 30-day cookie (e.g. an auth-schema rollback drops every session row), leaving +// the browser with a live-looking but dead cookie. Fails OPEN: any error/non-200/ +// parse failure returns true so the request still proxies — a re-pair is forced +// only on a definitive authenticated:false. +func sessionValid(e entry, c *http.Cookie) bool { + req, err := http.NewRequest(http.MethodGet, + fmt.Sprintf("http://127.0.0.1:%d/api/auth/session", e.Port), nil) + if err != nil { + return true + } + req.AddCookie(c) + resp, err := sessionClient.Do(req) + if err != nil { + return true + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + return true + } + var s struct { + Authenticated bool `json:"authenticated"` + } + if json.NewDecoder(resp.Body).Decode(&s) != nil { + return true + } + return s.Authenticated +} + +// isDocumentNav reports whether r is a top-level browser document navigation, as +// opposed to an XHR/fetch/asset/WebSocket sub-request. Only such requests are +// safe to answer with a re-pair 302 — redirecting a sub-resource would corrupt +// the SPA's fetch/WebSocket contract. Trust Sec-Fetch-Dest when present (all +// modern browsers send it); fall back to the Accept header otherwise. +func isDocumentNav(r *http.Request) bool { + if r.Method != http.MethodGet { + return false + } + if dest := r.Header.Get("Sec-Fetch-Dest"); dest != "" { + return dest == "document" + } + return strings.Contains(r.Header.Get("Accept"), "text/html") +} + +// pairEndpoints are the instance's session-bootstrap paths in preference order. +// t3 renamed /api/auth/bootstrap -> /api/auth/browser-session in 0.0.25; trying the +// new name first and falling back to the old lets ONE dispatch binary pair against +// either version — so the t3 pin can move forward (and survive a rolling-restart +// skew where some instances are already on the new version) without a 502 storm. +var pairEndpoints = []string{"/api/auth/browser-session", "/api/auth/bootstrap"} + +// exchangeCredential POSTs the pairing credential to the user's instance, trying +// each pairEndpoint in turn. A 404 means "absent in this t3 version" -> try the +// next; any other status is that endpoint's verdict, returned as-is. It also +// returns WHICH endpoint answered, so the caller can log the browser-session -> +// bootstrap fallback rate (a non-zero rate flags that the running t3 build moved +// the pairing API — the 2026-06-09 contract-drift class). Caller owns resp.Body. +func exchangeCredential(port int, credential string) (*http.Response, string, error) { + body, _ := json.Marshal(map[string]string{"credential": credential}) + var lastErr error + for _, ep := range pairEndpoints { + resp, err := http.Post(fmt.Sprintf("http://127.0.0.1:%d%s", port, ep), + "application/json", bytes.NewReader(body)) + if err != nil { + lastErr = err + continue + } + if resp.StatusCode == http.StatusNotFound { + resp.Body.Close() // endpoint absent in this t3 version — try the next + continue + } + return resp, ep, nil + } + if lastErr != nil { + return nil, "", lastErr + } + return nil, "", fmt.Errorf("no pairing endpoint accepted the request (all returned 404)") +} + // autoPair mints a one-time pairing token for the user's instance (as that OS -// user, via the scoped sudoers entry) and exchanges it at the instance's -// /api/auth/bootstrap, relaying the returned t3_session Set-Cookie to the browser. +// user, via the scoped sudoers entry) and exchanges it at the instance's pairing +// endpoint, relaying the returned t3_session Set-Cookie to the browser. func autoPair(e entry, w http.ResponseWriter, r *http.Request) { // t3-mint (root, via scoped sudoers) validates the OS user is in // /etc/ttyd-user-map, then mints as that user. The dispatch service itself // runs unprivileged and can invoke nothing else. - out, err := exec.Command("sudo", "-n", "/usr/local/bin/t3-mint", e.OsUser).Output() + out, err := mintToken(e.OsUser) if err != nil { log.Printf("mint for %s failed: %v", e.OsUser, err) http.Error(w, "pairing mint failed", http.StatusInternalServerError) @@ -79,22 +168,25 @@ func autoPair(e entry, w http.ResponseWriter, r *http.Request) { http.Error(w, "unparseable pairing output", http.StatusInternalServerError) return } - body, _ := json.Marshal(map[string]string{"credential": pc.Credential}) - resp, err := http.Post(fmt.Sprintf("http://127.0.0.1:%d/api/auth/bootstrap", e.Port), - "application/json", bytes.NewReader(body)) + resp, ep, err := exchangeCredential(e.Port, pc.Credential) if err != nil { + log.Printf("pairing exchange for %s failed: %v", e.OsUser, err) http.Error(w, "bootstrap request failed", http.StatusBadGateway) return } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { - log.Printf("bootstrap for %s returned %d", e.OsUser, resp.StatusCode) + log.Printf("pairing for %s returned %d (endpoint=%s)", e.OsUser, resp.StatusCode, ep) http.Error(w, "bootstrap rejected", http.StatusBadGateway) return } for _, c := range resp.Cookies() { http.SetCookie(w, c) // relays t3_session (HttpOnly; Path=/; SameSite=Lax) } + // Success line is the steady-state signal: endpoint= which pairing path won, + // fallback=true iff we fell back off the first-preference endpoint (running + // t3 build moved the pairing API). t3-probe / Loki alert on the fallback rate. + log.Printf("paired user=%s endpoint=%s fallback=%t", e.OsUser, ep, ep != pairEndpoints[0]) http.Redirect(w, r, "/", http.StatusFound) } @@ -111,13 +203,79 @@ func handler(w http.ResponseWriter, r *http.Request) { http.Error(w, "no t3 instance provisioned for this user", http.StatusForbidden) return } - if _, err := r.Cookie(cookieName); err != nil { + c, err := r.Cookie(cookieName) + if err != nil { + autoPair(e, w, r) + return + } + // A present cookie can still be server-side-invalid (sessions wiped/expired + // while the 30-day cookie lingers). On a top-level navigation, verify it and + // re-pair if dead — otherwise the instance just renders its pair page. Gated + // to document navs so we never 302 an XHR/asset/WebSocket sub-request. + if isDocumentNav(r) && !sessionValid(e, c) { autoPair(e, w, r) return } // Steady state: reverse-proxy (incl. WebSocket upgrade) to the user's instance. target, _ := url.Parse(fmt.Sprintf("http://127.0.0.1:%d", e.Port)) - httputil.NewSingleHostReverseProxy(target).ServeHTTP(w, r) + proxy := httputil.NewSingleHostReverseProxy(target) + + // WebSocket connection logging: t3 drops manifest as the client's 20s + // heartbeat watchdog reconnecting, so a flood of short-lived /ws connections + // IS the symptom. Log each WS open + close (duration + which side hung up) so + // a drop is attributable from logs alone — graceful closes otherwise leave no + // trace (the default ReverseProxy only logs on error). cause stays "graceful" + // unless ErrorHandler fires; ErrorHandler runs within ServeHTTP, so reading + // cause after ServeHTTP returns needs no synchronisation. + if isWebSocket(r) { + start := time.Now() + ip := clientIP(r) + cause := "graceful" + proxy.ErrorHandler = func(rw http.ResponseWriter, _ *http.Request, err error) { + cause = classifyClose(err) + } + log.Printf("ws open user=%s ip=%s", e.OsUser, ip) + proxy.ServeHTTP(w, r) + log.Printf("ws close user=%s ip=%s dur_ms=%d cause=%s", + e.OsUser, ip, time.Since(start).Milliseconds(), cause) + return + } + proxy.ServeHTTP(w, r) +} + +// isWebSocket reports whether r is a WebSocket upgrade request. +func isWebSocket(r *http.Request) bool { + return strings.EqualFold(r.Header.Get("Upgrade"), "websocket") && + strings.Contains(strings.ToLower(r.Header.Get("Connection")), "upgrade") +} + +// clientIP returns the forwarded client chain (X-Forwarded-For, set by +// Traefik/CF) when present, else the immediate peer — for correlating a drop +// to a specific client/edge. +func clientIP(r *http.Request) string { + if xff := r.Header.Get("X-Forwarded-For"); xff != "" { + return xff + } + return r.RemoteAddr +} + +// classifyClose maps a reverse-proxy copy error to which side ended the socket: +// downstream (client/CF/Traefik went away) vs upstream (the user's t3 serve +// closed/reset). Distinguishes a last-mile/client drop from a t3-serve stall. +func classifyClose(err error) string { + if err == nil { + return "graceful" + } + s := err.Error() + switch { + case strings.Contains(s, "context canceled"): + return "downstream_closed" // client / CF / Traefik tore down + case strings.Contains(s, "reset by peer"), strings.Contains(s, "broken pipe"), + strings.Contains(s, "EOF"), strings.Contains(s, "connection refused"): + return "upstream_closed" // t3 serve closed / unreachable + default: + return s + } } func main() { @@ -133,6 +291,7 @@ func main() { }() mux := http.NewServeMux() mux.HandleFunc("/healthz", func(w http.ResponseWriter, _ *http.Request) { _, _ = w.Write([]byte("ok\n")) }) + registerProbe(mux) mux.HandleFunc("/", handler) log.Printf("t3-dispatch listening on %s", listenAddr) log.Fatal(http.ListenAndServe(listenAddr, mux)) diff --git a/scripts/t3-dispatch/main_test.go b/scripts/t3-dispatch/main_test.go new file mode 100644 index 00000000..8d021a24 --- /dev/null +++ b/scripts/t3-dispatch/main_test.go @@ -0,0 +1,399 @@ +package main + +import ( + "net/http" + "net/http/httptest" + "net/url" + "strconv" + "strings" + "testing" + + "github.com/gorilla/websocket" +) + +func portOf(t *testing.T, ts *httptest.Server) int { + t.Helper() + u, err := url.Parse(ts.URL) + if err != nil { + t.Fatalf("parse %s: %v", ts.URL, err) + } + p, err := strconv.Atoi(u.Port()) + if err != nil { + t.Fatalf("port %s: %v", u.Port(), err) + } + return p +} + +func TestIsDocumentNav(t *testing.T) { + cases := []struct { + name string + method string + headers map[string]string + want bool + }{ + {"GET sec-fetch-dest document", "GET", map[string]string{"Sec-Fetch-Dest": "document"}, true}, + {"GET accept html (no sec-fetch)", "GET", map[string]string{"Accept": "text/html,application/xhtml+xml"}, true}, + {"GET xhr empty dest beats accept", "GET", map[string]string{"Sec-Fetch-Dest": "empty", "Accept": "text/html"}, false}, + {"GET json", "GET", map[string]string{"Accept": "application/json"}, false}, + {"POST html", "POST", map[string]string{"Accept": "text/html"}, false}, + {"GET no headers", "GET", map[string]string{}, false}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + r, _ := http.NewRequest(c.method, "/", nil) + for k, v := range c.headers { + r.Header.Set(k, v) + } + if got := isDocumentNav(r); got != c.want { + t.Errorf("isDocumentNav = %v, want %v", got, c.want) + } + }) + } +} + +func sessionServer(status int, body string) *httptest.Server { + return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path != "/api/auth/session" { + http.NotFound(w, r) + return + } + w.WriteHeader(status) + _, _ = w.Write([]byte(body)) + })) +} + +func TestSessionValid(t *testing.T) { + ck := &http.Cookie{Name: cookieName, Value: "x"} + + t.Run("authenticated true -> valid", func(t *testing.T) { + ts := sessionServer(200, `{"authenticated":true}`) + defer ts.Close() + if !sessionValid(entry{Port: portOf(t, ts)}, ck) { + t.Fatal("want valid (true) for authenticated:true") + } + }) + t.Run("authenticated false -> invalid", func(t *testing.T) { + ts := sessionServer(200, `{"authenticated":false}`) + defer ts.Close() + if sessionValid(entry{Port: portOf(t, ts)}, ck) { + t.Fatal("want invalid (false) for authenticated:false") + } + }) + t.Run("500 -> fail-open valid", func(t *testing.T) { + ts := sessionServer(500, `boom`) + defer ts.Close() + if !sessionValid(entry{Port: portOf(t, ts)}, ck) { + t.Fatal("want fail-open true on 500") + } + }) + t.Run("malformed json -> fail-open valid", func(t *testing.T) { + ts := sessionServer(200, `not json`) + defer ts.Close() + if !sessionValid(entry{Port: portOf(t, ts)}, ck) { + t.Fatal("want fail-open true on unparseable body") + } + }) + t.Run("unreachable -> fail-open valid", func(t *testing.T) { + ts := sessionServer(200, `{"authenticated":false}`) + p := portOf(t, ts) + ts.Close() // nothing listening now + if !sessionValid(entry{Port: p}, ck) { + t.Fatal("want fail-open true on connection refused") + } + }) +} + +// fakeInstance serves the three endpoints the dispatcher touches: the session +// check, the bootstrap exchange, and a catch-all standing in for the proxied app. +func fakeInstance(authenticated bool, bootstrapCalled *bool) *httptest.Server { + return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch r.URL.Path { + case "/api/auth/session": + if authenticated { + _, _ = w.Write([]byte(`{"authenticated":true}`)) + } else { + _, _ = w.Write([]byte(`{"authenticated":false}`)) + } + case "/api/auth/bootstrap": + if bootstrapCalled != nil { + *bootstrapCalled = true + } + http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "fresh", Path: "/"}) + _, _ = w.Write([]byte(`{"authenticated":true}`)) + case "/api/auth/browser-session": + http.NotFound(w, r) // models a 0.0.24 instance: the 0.0.25 endpoint is absent + default: + _, _ = w.Write([]byte("APP")) + } + })) +} + +func setTable(port int) { + mu.Lock() + table = map[string]entry{"vbarzin": {OsUser: "wizard", Port: port}} + mu.Unlock() +} + +func TestHandlerRepairsOnInvalidCookieDocNav(t *testing.T) { + called := false + ts := fakeInstance(false, &called) + defer ts.Close() + setTable(portOf(t, ts)) + + orig := mintToken + mintToken = func(string) ([]byte, error) { return []byte(`{"credential":"tok"}`), nil } + defer func() { mintToken = orig }() + + r := httptest.NewRequest("GET", "/", nil) + r.Header.Set("X-authentik-username", "vbarzin@gmail.com") + r.Header.Set("Sec-Fetch-Dest", "document") + r.AddCookie(&http.Cookie{Name: cookieName, Value: "stale"}) + w := httptest.NewRecorder() + + handler(w, r) + + if w.Code != http.StatusFound { + t.Fatalf("stale cookie on doc-nav should re-pair (302), got %d body=%q", w.Code, w.Body.String()) + } + if !called { + t.Fatal("expected bootstrap to be called during re-pair") + } + cookies := w.Result().Cookies() + if len(cookies) == 0 || cookies[0].Value != "fresh" { + t.Fatalf("expected fresh t3_session relayed, got %+v", cookies) + } +} + +func TestHandlerProxiesOnValidCookie(t *testing.T) { + ts := fakeInstance(true, nil) + defer ts.Close() + setTable(portOf(t, ts)) + + r := httptest.NewRequest("GET", "/", nil) + r.Header.Set("X-authentik-username", "vbarzin@gmail.com") + r.Header.Set("Sec-Fetch-Dest", "document") + r.AddCookie(&http.Cookie{Name: cookieName, Value: "good"}) + w := httptest.NewRecorder() + + handler(w, r) + + if w.Code != http.StatusOK || w.Body.String() != "APP" { + t.Fatalf("valid cookie should proxy (200 APP), got %d %q", w.Code, w.Body.String()) + } +} + +func TestHandlerProxiesXHREvenIfCookieInvalid(t *testing.T) { + called := false + ts := fakeInstance(false, &called) // session would say invalid, but XHR must NOT be re-paired + defer ts.Close() + setTable(portOf(t, ts)) + + r := httptest.NewRequest("GET", "/api/threads", nil) + r.Header.Set("X-authentik-username", "vbarzin@gmail.com") + r.Header.Set("Sec-Fetch-Dest", "empty") // XHR/fetch, not a document nav + r.AddCookie(&http.Cookie{Name: cookieName, Value: "stale"}) + w := httptest.NewRecorder() + + handler(w, r) + + if called { + t.Fatal("must NOT re-pair (302) a non-document sub-request — would corrupt the SPA fetch contract") + } + if w.Code != http.StatusOK || w.Body.String() != "APP" { + t.Fatalf("XHR should proxy through, got %d %q", w.Code, w.Body.String()) + } +} + +// pairInstance simulates a t3 instance that exposes pairing at exactly one path +// (200 + t3_session) and 404s the other known path — modeling the 0.0.25 rename of +// /api/auth/bootstrap -> /api/auth/browser-session. records which path was hit. +func pairInstance(pairPath string, hit *string) *httptest.Server { + return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch r.URL.Path { + case "/api/auth/browser-session", "/api/auth/bootstrap": + if r.URL.Path != pairPath { + http.NotFound(w, r) // endpoint absent in this t3 version + return + } + if hit != nil { + *hit = r.URL.Path + } + http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "fresh", Path: "/"}) + _, _ = w.Write([]byte(`{"authenticated":true}`)) + default: + http.NotFound(w, r) + } + })) +} + +// TestAutoPairAcrossVersions: one dispatch binary must pair against BOTH the +// 0.0.24 endpoint (/api/auth/bootstrap) and the 0.0.25 one (/api/auth/browser-session), +// so the pin can move forward (and survive rolling-restart skew) without a 502 storm. +func TestAutoPairAcrossVersions(t *testing.T) { + orig := mintToken + mintToken = func(string) ([]byte, error) { return []byte(`{"credential":"tok"}`), nil } + defer func() { mintToken = orig }() + + for _, tc := range []struct{ name, pairPath string }{ + {"0.0.25 browser-session", "/api/auth/browser-session"}, + {"0.0.24 bootstrap", "/api/auth/bootstrap"}, + } { + t.Run(tc.name, func(t *testing.T) { + var hit string + ts := pairInstance(tc.pairPath, &hit) + defer ts.Close() + setTable(portOf(t, ts)) + + r := httptest.NewRequest("GET", "/", nil) + r.Header.Set("X-authentik-username", "vbarzin@gmail.com") // no cookie -> autoPair + w := httptest.NewRecorder() + handler(w, r) + + if w.Code != http.StatusFound { + t.Fatalf("want 302 re-pair, got %d body=%q", w.Code, w.Body.String()) + } + if hit != tc.pairPath { + t.Fatalf("want pairing via %s, hit=%q", tc.pairPath, hit) + } + if cs := w.Result().Cookies(); len(cs) == 0 || cs[0].Value != "fresh" { + t.Fatalf("want fresh t3_session relayed, got %+v", cs) + } + }) + } +} + +// TestExchangeCredentialReportsEndpoint: exchangeCredential must report WHICH +// pairing endpoint accepted the credential, so the dispatch can log it and we +// can alert on the browser-session -> bootstrap fallback rate (a non-zero rate +// means the running t3 build moved/renamed the pairing API — contract drift, the +// 2026-06-09 failure class). fallback = endpoint is not the first-preference one. +func TestExchangeCredentialReportsEndpoint(t *testing.T) { + for _, tc := range []struct { + name, pairPath, wantEP string + wantFallback bool + }{ + {"0.0.25 browser-session (primary)", "/api/auth/browser-session", "/api/auth/browser-session", false}, + {"0.0.24 bootstrap (fallback)", "/api/auth/bootstrap", "/api/auth/bootstrap", true}, + } { + t.Run(tc.name, func(t *testing.T) { + var hit string + ts := pairInstance(tc.pairPath, &hit) + defer ts.Close() + + resp, ep, err := exchangeCredential(portOf(t, ts), "tok") + if err != nil { + t.Fatalf("exchangeCredential: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + t.Fatalf("status = %d, want 200", resp.StatusCode) + } + if ep != tc.wantEP { + t.Fatalf("endpoint = %q, want %q", ep, tc.wantEP) + } + if gotFallback := ep != pairEndpoints[0]; gotFallback != tc.wantFallback { + t.Fatalf("fallback = %v, want %v", gotFallback, tc.wantFallback) + } + }) + } +} + +func TestProbeHealthz(t *testing.T) { + mux := http.NewServeMux() + registerProbe(mux) + ts := httptest.NewServer(mux) + defer ts.Close() + resp, err := http.Get(ts.URL + "/probe/healthz") + if err != nil { + t.Fatalf("GET /probe/healthz: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + t.Errorf("status = %d, want 200", resp.StatusCode) + } +} + +func TestProbeWSEcho(t *testing.T) { + mux := http.NewServeMux() + registerProbe(mux) + ts := httptest.NewServer(mux) + defer ts.Close() + wsURL := "ws" + strings.TrimPrefix(ts.URL, "http") + "/probe/ws" + c, _, err := websocket.DefaultDialer.Dial(wsURL, nil) + if err != nil { + t.Fatalf("dial %s: %v", wsURL, err) + } + defer c.Close() + for _, msg := range []string{"ping 1718000000", "ping 1718000010"} { + if err := c.WriteMessage(websocket.TextMessage, []byte(msg)); err != nil { + t.Fatalf("write: %v", err) + } + _, got, err := c.ReadMessage() + if err != nil { + t.Fatalf("read: %v", err) + } + if string(got) != msg { + t.Errorf("echo = %q, want %q", got, msg) + } + } +} + +func TestIsWebSocket(t *testing.T) { + cases := []struct { + up, conn string + want bool + }{ + {"websocket", "Upgrade", true}, + {"websocket", "keep-alive, Upgrade", true}, + {"WebSocket", "upgrade", true}, + {"", "keep-alive", false}, + {"h2c", "Upgrade", false}, + {"websocket", "keep-alive", false}, + } + for _, c := range cases { + r, _ := http.NewRequest("GET", "/ws", nil) + if c.up != "" { + r.Header.Set("Upgrade", c.up) + } + r.Header.Set("Connection", c.conn) + if got := isWebSocket(r); got != c.want { + t.Errorf("isWebSocket(up=%q conn=%q)=%v want %v", c.up, c.conn, got, c.want) + } + } +} + +func TestClassifyClose(t *testing.T) { + cases := []struct { + in error + want string + }{ + {nil, "graceful"}, + {errTest("context canceled"), "downstream_closed"}, + {errTest("read tcp 127.0.0.1:60664->127.0.0.1:3773: read: connection reset by peer"), "upstream_closed"}, + {errTest("write: broken pipe"), "upstream_closed"}, + {errTest("unexpected EOF"), "upstream_closed"}, + {errTest("dial tcp 127.0.0.1:3773: connect: connection refused"), "upstream_closed"}, + {errTest("some novel error"), "some novel error"}, + } + for _, c := range cases { + if got := classifyClose(c.in); got != c.want { + t.Errorf("classifyClose(%v)=%q want %q", c.in, got, c.want) + } + } +} + +type errTest string + +func (e errTest) Error() string { return string(e) } + +func TestClientIP(t *testing.T) { + r, _ := http.NewRequest("GET", "/ws", nil) + r.RemoteAddr = "10.0.0.5:1234" + if got := clientIP(r); got != "10.0.0.5:1234" { + t.Errorf("clientIP no-xff = %q", got) + } + r.Header.Set("X-Forwarded-For", "1.2.3.4, 10.10.1.1") + if got := clientIP(r); got != "1.2.3.4, 10.10.1.1" { + t.Errorf("clientIP xff = %q", got) + } +} diff --git a/scripts/t3-dispatch/probe.go b/scripts/t3-dispatch/probe.go new file mode 100644 index 00000000..df689bce --- /dev/null +++ b/scripts/t3-dispatch/probe.go @@ -0,0 +1,49 @@ +// probe.go: unauthenticated path-health surface for the in-cluster t3-probe. +// /probe/* is carved out of Authentik (stacks/t3code `module "ingress_probe"`) +// so a synthetic client can hold a long-lived WebSocket here via two routes +// (Cloudflare edge vs internal Traefik) and attribute connection drops to a +// path segment. It echoes tiny frames and reaches no t3 instance — nothing +// user-grade is exposed. +package main + +import ( + "net/http" + "time" + + "github.com/gorilla/websocket" +) + +// Reap connections whose client went silent; the probe pings every 10s, so 90s +// of silence means the peer is gone even if TCP never noticed. +const probeIdleLimit = 90 * time.Second + +var probeUpgrader = websocket.Upgrader{ + // No cookies or credentials are at stake on an echo endpoint, and the + // probe connects without a browser Origin — checking it would only break it. + CheckOrigin: func(*http.Request) bool { return true }, +} + +func registerProbe(mux *http.ServeMux) { + mux.HandleFunc("/probe/healthz", func(w http.ResponseWriter, _ *http.Request) { + _, _ = w.Write([]byte("ok\n")) + }) + mux.HandleFunc("/probe/ws", func(w http.ResponseWriter, r *http.Request) { + c, err := probeUpgrader.Upgrade(w, r, nil) + if err != nil { + return // Upgrade has already written the HTTP error + } + defer c.Close() + for { + if err := c.SetReadDeadline(time.Now().Add(probeIdleLimit)); err != nil { + return + } + mt, msg, err := c.ReadMessage() + if err != nil { + return + } + if err := c.WriteMessage(mt, msg); err != nil { + return + } + } + }) +} diff --git a/scripts/t3-migrate-idle.service b/scripts/t3-migrate-idle.service new file mode 100644 index 00000000..97c28faa --- /dev/null +++ b/scripts/t3-migrate-idle.service @@ -0,0 +1,8 @@ +[Unit] +Description=t3 idle migrator — restart deferred t3-serve instances onto the current binary when idle +Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/master/docs/plans/2026-06-21-t3-idle-migrate-design.md +After=network.target t3-dispatch.service + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/t3-migrate-idle diff --git a/scripts/t3-migrate-idle.sh b/scripts/t3-migrate-idle.sh new file mode 100644 index 00000000..85835374 --- /dev/null +++ b/scripts/t3-migrate-idle.sh @@ -0,0 +1,86 @@ +#!/usr/bin/env bash +# t3-migrate-idle.sh — drains t3-autoupdate's deferral markers (via the overnight +# t3-migrate-idle.timer). For each deferred t3-serve@<user>, if nothing is actively +# working in that instance (no in-flight turn + a quiet buffer), restart it onto the +# current binary using the shared safe_restart_unit, then clear the marker. +# Why this exists: t3-autoupdate defers a user with an active agent at its single +# daily window; a user busy every night never migrates and their client shows +# "Client and server versions differ". See docs/plans/2026-06-21-t3-idle-migrate-*. +set -uo pipefail + +LOG_TAG=t3-migrate-idle +# shellcheck source=scripts/t3-safe-restart.sh +. "${T3_SAFE_RESTART_LIB:-/usr/local/lib/t3-safe-restart.sh}" + +QUIET_SECONDS="${T3_MIGRATE_QUIET_SECONDS:-900}" # required idle before a restart (15 min) +DRY_RUN="${T3_DRY_RUN:-0}" + +# pure logic: is it safe given <active_turns> and <idle_seconds>? fail closed. +gate_is_safe() { + local active="$1" idle="$2" + case "$active" in ''|*[!0-9]*) return 1;; esac # unparseable/empty active -> unsafe + [ "$active" -eq 0 ] || return 1 # a turn is running -> unsafe + [ -z "$idle" ] && return 0 # no threads at all -> safe + case "$idle" in ''|*[!0-9-]*) return 1;; esac # non-numeric -> unsafe + [ "$idle" -ge "$QUIET_SECONDS" ] # negative or < quiet -> unsafe +} + +# query a state.sqlite (path or file: URI). Echoes "<active_turns>|<idle_seconds>". +# idle_seconds is empty when there are no rows. Normalizes ISO 'T'/'Z' for julianday. +gate_query() { + local db="$1" + sqlite3 -batch -noheader -separator '|' "$db" \ + "SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" +} + +# safe_to_restart <user>: wire runuser + the user's DB into gate_query/gate_is_safe. +safe_to_restart() { + local u="$1" db row + db="/home/$u/.t3/userdata/state.sqlite"; [ -f "$db" ] || return 1 + row="$(runuser -u "$u" -- sqlite3 -batch -noheader -separator '|' "file:$db?mode=ro" \ + "SELECT + (SELECT count(*) FROM projection_thread_sessions WHERE active_turn_id IS NOT NULL), + CAST((julianday('now') - julianday(replace(replace(max(updated_at),'T',' '),'Z',''))) * 86400 AS INT) + FROM projection_thread_sessions;" 2>/dev/null)" || return 1 + gate_is_safe "${row%%|*}" "${row##*|}" +} + +main() { + # a frozen build must not be auto-migrated (shared switch with t3-autoupdate) + if [ -e "$FREEZE_FILE" ]; then LOG "FROZEN: $FREEZE_FILE present — not draining deferrals"; exit 0; fi + [ -d "$DEFER_DIR" ] || exit 0 # nothing deferred + last_good="$(tr -d '[:space:]' <"$LAST_GOOD_FILE" 2>/dev/null)" # rollback target for the helper + + local marker u unit started mwritten migrated=0 skipped=0 + for marker in "$DEFER_DIR"/*; do + [ -e "$marker" ] || continue # empty-dir glob + u="$(basename "$marker")"; unit="t3-serve@$u.service" + if ! systemctl is-active --quiet "$unit"; then + LOG "clearing marker for $u: $unit not active"; rm -f "$marker"; continue + fi + started="$(date -d "$(systemctl show -p ActiveEnterTimestamp --value "$unit" 2>/dev/null)" +%s 2>/dev/null || echo 0)" + mwritten="$(stat -c %Y "$marker" 2>/dev/null || echo 0)" + if [ "$started" -gt "$mwritten" ]; then + LOG "clearing marker for $u: $unit already restarted $((started-mwritten))s after the deferral"; rm -f "$marker"; continue + fi + if ! safe_to_restart "$u"; then skipped=$((skipped+1)); continue; fi + + target="$(tr -d '[:space:]' <"$marker" 2>/dev/null)"; [ -n "$target" ] || target="$(ver)" + if [ "$DRY_RUN" = "1" ]; then LOG "DRY_RUN: would migrate $unit -> $target (idle gate satisfied)"; continue; fi + if ! backup_user "$u" >/dev/null; then + LOG "WARN: pre-restart backup failed for $u — skipping (fail closed)"; skipped=$((skipped+1)); continue + fi + if safe_restart_unit "$unit" "$u"; then + LOG "migrated $unit -> $target (idle restart)"; rm -f "$marker"; migrated=$((migrated+1)) + else + LOG "migrate FAILED for $unit — recovery+freeze handled by safe_restart_unit; stopping drain"; exit 1 + fi + done + LOG "idle-migrate pass complete (migrated=$migrated skipped=$skipped)" +} + +# main-guard: run only when executed, not when sourced (tests source this file). +if [ "${BASH_SOURCE[0]}" = "${0}" ]; then main "$@"; fi diff --git a/scripts/t3-migrate-idle.timer b/scripts/t3-migrate-idle.timer new file mode 100644 index 00000000..0c847fa6 --- /dev/null +++ b/scripts/t3-migrate-idle.timer @@ -0,0 +1,10 @@ +[Unit] +Description=Overnight drain of t3-autoupdate deferrals (idle-gated t3-serve migration) + +[Timer] +OnCalendar=*-*-* 01..05:00/20 +RandomizedDelaySec=120 +Persistent=false + +[Install] +WantedBy=timers.target diff --git a/scripts/t3-provision-users.sh b/scripts/t3-provision-users.sh index 33edf3fd..9cbc6c1e 100644 --- a/scripts/t3-provision-users.sh +++ b/scripts/t3-provision-users.sh @@ -20,29 +20,174 @@ MAP=/etc/ttyd-user-map DRY_RUN="${DRY_RUN:-0}" # Public infra repo for the locked clone (no auth; the monorepo has no remote). INFRA_REMOTE="${INFRA_REMOTE:-https://github.com/ViktorBarzin/infra.git}" +# Canonical push target for non-admin infra clones (AGENTS.md "Non-admin +# workstation users"), and the base URL for workspace-layout `repos` entries — +# those clone AS the user so their ~/.git-credentials PAT authenticates +# against private Forgejo repos. +FORGEJO_INFRA_REMOTE="${FORGEJO_INFRA_REMOTE:-https://forgejo.viktorbarzin.me/viktor/infra.git}" +REPO_REMOTE_BASE="${REPO_REMOTE_BASE:-https://forgejo.viktorbarzin.me/viktor}" # Per-user OIDC kubeconfig (kubelogin/PKCE; cluster server+CA copied from the admin kubeconfig). OIDC_ISSUER="${OIDC_ISSUER:-https://authentik.viktorbarzin.me/application/o/kubernetes/}" ADMIN_KUBECONFIG="${ADMIN_KUBECONFIG:-/home/wizard/.kube/config}" +# OS users (space-separated) that receive the vendored agent skills (scripts/workstation/claude-skills). +# Allowlist: install_skills no-ops for anyone not listed. Extend here to roll out to more users. +SKILL_USERS="${SKILL_USERS:-emo}" log() { echo "[t3-provision] $*"; } run() { if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] $*"; else "$@"; fi; } -# Per-non-admin writable, git-crypt-LOCKED infra clone at ~/code. Keyless + +# Per-non-admin writable, git-crypt-LOCKED infra clone at ~/<subpath>. Keyless + # filter=cat ⇒ code/docs are plaintext, git-crypt'd secret files stay ciphertext. # Writable + ungated (push != apply; applies are admin-only). NEVER touches an -# existing ~/code (so emo's symlink survives until the gated cutover). +# existing target (so emo's symlink survives until the gated cutover). subpath +# is "code" (single layout) or "code/infra" (workspace layout). install_locked_clone() { - local user="$1" home + local user="$1" sub="$2" home dst home="$(getent passwd "$user" | cut -d: -f6)" [[ -z "$home" ]] && return 0 - [[ -e "$home/code" || -L "$home/code" ]] && return 0 - if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] locked infra clone -> $user:$home/code"; return 0; fi - log "clone locked infra -> $user:~/code" - runuser -u "$user" -- git clone --quiet --no-checkout "$INFRA_REMOTE" "$home/code" - runuser -u "$user" -- git -C "$home/code" config filter.git-crypt.smudge cat - runuser -u "$user" -- git -C "$home/code" config filter.git-crypt.clean cat - runuser -u "$user" -- git -C "$home/code" config filter.git-crypt.required false - runuser -u "$user" -- git -C "$home/code" checkout --quiet master + dst="$home/$sub" + [[ -e "$dst" || -L "$dst" ]] && return 0 + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] locked infra clone -> $user:$dst"; return 0; fi + log "clone locked infra -> $user:~/$sub" + runuser -u "$user" -- git clone --quiet --no-checkout "$INFRA_REMOTE" "$dst" + runuser -u "$user" -- git -C "$dst" config filter.git-crypt.smudge cat + runuser -u "$user" -- git -C "$dst" config filter.git-crypt.clean cat + runuser -u "$user" -- git -C "$dst" config filter.git-crypt.required false + runuser -u "$user" -- git -C "$dst" checkout --quiet master +} + +# Keep an EXISTING non-admin clone fresh (the admin's tree is never touched): fetch +# all remotes, then fast-forward master only when that is provably safe — on master, +# clean tree, upstream configured. Never rebases/merges; a non-ff master (local +# commits) is the user's to reconcile and is only WARNed about. Fetch failures +# (offline, missing credentials) are non-fatal: freshness is best-effort. +refresh_user_clone() { + local user="$1" sub="$2" home dir + home="$(getent passwd "$user" | cut -d: -f6)" + dir="$home/$sub" + [[ -n "$home" && -d "$dir/.git" ]] || return 0 + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] refresh clone -> $user:$dir"; return 0; fi + runuser -u "$user" -- env GIT_TERMINAL_PROMPT=0 git -C "$dir" fetch --all --prune --quiet 2>/dev/null \ + || { log "WARN: fetch failed for $user:$sub (offline/credentials?) — skipped"; return 0; } + [[ "$(runuser -u "$user" -- git -C "$dir" symbolic-ref --short -q HEAD)" == master ]] || return 0 + [[ -z "$(runuser -u "$user" -- git -C "$dir" status --porcelain)" ]] || return 0 + runuser -u "$user" -- git -C "$dir" rev-parse --verify -q 'master@{upstream}' >/dev/null || return 0 + runuser -u "$user" -- git -C "$dir" merge --ff-only 'master@{upstream}' >/dev/null 2>&1 \ + || log "WARN: $user:$sub master not fast-forwardable (local commits?) — left as-is" +} + +# Non-admin infra clones are documented to carry a `forgejo` remote (the +# canonical push target) with master tracking forgejo/master — see AGENTS.md +# "Non-admin workstation users". Clones made before that contract only have +# the GitHub origin; wire the remote + upstream idempotently. Best-effort: an +# offline fetch leaves the upstream as-is. +wire_forgejo_remote() { + local user="$1" sub="$2" home dir + home="$(getent passwd "$user" | cut -d: -f6)" + dir="$home/$sub" + [[ -n "$home" && -d "$dir/.git" ]] || return 0 + if ! runuser -u "$user" -- git -C "$dir" remote get-url forgejo >/dev/null 2>&1; then + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] add forgejo remote -> $user:$sub"; return 0; fi + log "add forgejo remote -> $user:~/$sub" + runuser -u "$user" -- git -C "$dir" remote add forgejo "$FORGEJO_INFRA_REMOTE" + fi + [[ "$DRY_RUN" == 1 ]] && return 0 + [[ "$(runuser -u "$user" -- git -C "$dir" rev-parse --abbrev-ref -q 'master@{upstream}' 2>/dev/null)" == forgejo/master ]] && return 0 + runuser -u "$user" -- env GIT_TERMINAL_PROMPT=0 git -C "$dir" fetch --quiet forgejo 2>/dev/null \ + || { log "WARN: forgejo fetch failed for $user — upstream left as-is"; return 0; } + runuser -u "$user" -- git -C "$dir" branch --set-upstream-to=forgejo/master master >/dev/null 2>&1 \ + && log "set $user:~/$sub master upstream -> forgejo/master" \ + || log "WARN: could not set $user:~/$sub master upstream to forgejo/master" +} + +# Workspace layout: ~/code is a plain directory of per-project clones. A user +# still on the single layout (~/code IS the infra clone) is migrated by moving +# the whole clone — local branches, dirty files, untracked state all survive — +# to ~/code/infra. Running processes follow the moved inode, so live sessions +# keep working (their cwd lands inside ~/code/infra). +ensure_workspace_layout() { + local user="$1" home tmp + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -z "$home" ]] && return 0 + if [[ -d "$home/code/.git" ]]; then + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] migrate $user:~/code (single clone) -> ~/code/infra"; return 0; fi + log "migrate $user: ~/code (single infra clone) -> ~/code/infra" + tmp="$home/.code-workspace-migrate.$$" + mv "$home/code" "$tmp" + install -d -o "$user" -g "$user" -m 0755 "$home/code" + mv "$tmp" "$home/code/infra" + elif [[ ! -e "$home/code" ]]; then + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] create workspace dir $user:~/code"; return 0; fi + install -d -o "$user" -g "$user" -m 0755 "$home/code" + fi +} + +# Single-layout clones often accumulated nested project clones (the old layout +# gave users nowhere else to put them — e.g. ancamilea's tripit inside ~/code). +# After migration such a clone would sit buried at ~/code/infra/<repo>; hoist a +# roster repo to its workspace home instead of stranding it + cloning fresh. +# Only untracked git dirs move — content the infra repo tracks is never touched. +hoist_nested_repo() { + local user="$1" repo="$2" home src dst + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -z "$home" ]] && return 0 + src="$home/code/infra/$repo"; dst="$home/code/$repo" + [[ -d "$src/.git" && ! -e "$dst" ]] || return 0 + runuser -u "$user" -- git -C "$home/code/infra" ls-files --error-unmatch "$repo" >/dev/null 2>&1 && return 0 + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] hoist nested $repo -> $user:$dst"; return 0; fi + log "hoist nested $repo clone -> $user:~/code/$repo" + mv "$src" "$dst" +} + +# Extra per-project repos for workspace-layout users, cloned from Forgejo AS +# the user (their ~/.git-credentials PAT authenticates against private repos). +# A failed clone (no access yet, offline) is a WARN — the reconcile must never +# abort over a single repo; the next hourly run retries. +install_user_repo() { + local user="$1" repo="$2" home dst + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -z "$home" ]] && return 0 + dst="$home/code/$repo" + [[ -e "$dst" || -L "$dst" ]] && return 0 + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] clone $REPO_REMOTE_BASE/$repo.git -> $user:$dst"; return 0; fi + log "clone $repo -> $user:~/code/$repo" + runuser -u "$user" -- env GIT_TERMINAL_PROMPT=0 git clone --quiet "$REPO_REMOTE_BASE/$repo.git" "$dst" 2>/dev/null \ + || log "WARN: clone of $repo failed for $user (access/offline?) — skipped" +} + +# Machine-wide Claude managed config: the repo file (in the admin tree, like the +# roster) is the authoring surface; deploying it here means a plain infra commit +# propagates claudeMd/model edits to /etc — and thus every user's NEXT session — +# within one reconcile cycle. No manual install step. +sync_managed_config() { + local src="$WORKSTATION_DIR/managed-settings.json" dst=/etc/claude-code/managed-settings.json + [[ -r "$src" ]] || return 0 + python3 -c "import json,sys; json.load(open(sys.argv[1]))" "$src" 2>/dev/null \ + || { log "WARN: $src is invalid JSON — managed-config sync skipped"; return 0; } + cmp -s "$src" "$dst" 2>/dev/null && return 0 + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] managed-settings.json -> $dst"; return 0; fi + install -D -m 0644 "$src" "$dst" + log "deployed managed-settings.json -> /etc/claude-code (repo copy changed)" +} + +# ~/.codex/AGENTS.md is a STATIC mirror of the managed claudeMd (codex has no +# machine-wide managed layer). Regenerate stale mirrors so codex sessions inherit +# claudeMd edits the same way Claude sessions do. Never clobbers a user-customized +# file: only touches files carrying the mirror header (or creates absent ones). +refresh_codex_mirror() { + local user="$1" home dst tmp + home="$(getent passwd "$user" | cut -d: -f6)" + dst="$home/.codex/AGENTS.md" + [[ -n "$home" && -d "$home/.codex" ]] || return 0 + if [[ -f "$dst" ]] && ! head -1 "$dst" | grep -q '^# Codex global instructions (devvm)'; then return 0; fi + tmp="$(mktemp)" + { printf '# Codex global instructions (devvm)\n\n_Mirrors the machine-wide Claude managed policy._\n\n---\n\n' + python3 -c 'import json; print(json.load(open("/etc/claude-code/managed-settings.json"))["claudeMd"])' + } > "$tmp" 2>/dev/null || { rm -f "$tmp"; return 0; } + if cmp -s "$tmp" "$dst" 2>/dev/null; then rm -f "$tmp"; return 0; fi + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] codex AGENTS.md mirror -> $user"; rm -f "$tmp"; return 0; fi + install -o "$user" -g "$user" -m 0644 "$tmp" "$dst"; rm -f "$tmp" + log "refreshed codex AGENTS.md mirror -> $user" } # Per-user OIDC kubeconfig (kubelogin/PKCE — the `kubernetes` Authentik client is @@ -95,18 +240,297 @@ EOF log "wrote OIDC kubeconfig -> $user:~/.kube/config" } +# Idempotently set KEY=VALUE in a t3-serve env file, PRESERVING other lines — so writing +# T3_PORT never clobbers an injected CLAUDE_CODE_OAUTH_TOKEN, and vice-versa. Mode 0600. +env_set() { + local file="$1" key="$2" val="$3" + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] set $key -> $file"; return 0; fi + install -d -m 0755 "$(dirname "$file")" + if [[ -f "$file" ]] && grep -q "^${key}=" "$file"; then + grep -qx "${key}=${val}" "$file" || sed -i "s|^${key}=.*|${key}=${val}|" "$file" + else + printf '%s=%s\n' "$key" "$val" >> "$file" + fi + chmod 600 "$file" +} + +env_unset() { + local file="$1" key="$2" + [[ -f "$file" ]] || return 0 + grep -q "^${key}=" "$file" || return 0 + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] unset $key -> $file"; return 0; fi + sed -i "/^${key}=.*/d" "$file" + chmod 600 "$file" + log "removed legacy shared $key -> $(basename "$file")" +} + +# Install one user's isolated Claude credential renewal flow. The scoped periodic +# Vault token is minted only when this reconcile has admin Vault access (normal +# onboarding/deployment); routine token renewal is performed by the user service. +install_claude_auth_sync() { + local user="$1" home cfg token_file token policy + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -z "$home" ]] && return 0 + cfg="$home/.config/claude-auth-sync" + token_file="$cfg/vault-token" + policy="workstation-claude-$user" + + # The service sandbox makes the rest of $HOME read-only. Pre-create every + # writable path before systemd enters that sandbox; ReadWritePaths cannot + # create a missing child beneath a read-only parent. + if [[ "$DRY_RUN" == 1 ]]; then + echo "[dry-run] ensure Claude-auth state dirs -> $user" + else + install -d -o "$user" -g "$user" -m 0700 "$cfg" "$home/.local/state/claude-auth-sync" + fi + + if [[ ! -s "$token_file" ]]; then + if [[ "$DRY_RUN" == 1 ]]; then + echo "[dry-run] mint scoped Claude-auth Vault token -> $user" + elif vault token lookup >/dev/null 2>&1 && \ + token="$(vault token create -orphan -period=768h -policy="$policy" \ + -display-name="devvm-claude-auth-$user" -field=token 2>/dev/null)"; then + install -d -o "$user" -g "$user" -m 0700 "$cfg" + install -o "$user" -g "$user" -m 0600 /dev/stdin "$token_file" <<<"$token" + log "minted isolated Claude-auth Vault token -> $user" + else + log "WARN: scoped Claude-auth Vault token missing for $user (run provisioner with admin VAULT_TOKEN after vault stack apply)" + fi + fi + run systemctl enable --now "claude-auth-sync@$user.timer" >/dev/null 2>&1 || true +} + +# Re-deploy the managed per-user Claude launcher to ~/start-claude.sh. /etc/skel only +# seeds it at account creation (setup-devvm.sh), so without this a launcher edit never +# reaches EXISTING users — they keep running a stale copy. Copy-if-changed from the repo's +# skel/, owned by the user, 0755. (We deliberately do NOT re-copy .tmux.conf: terminal-lobby +# appends a managed persistence section to each user's ~/.tmux.conf that a re-copy would clobber.) +deploy_user_launcher() { + local user="$1" home src dst + src="$WORKSTATION_DIR/skel/start-claude.sh" + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" && -f "$src" ]] || return 0 + dst="$home/start-claude.sh" + cmp -s "$src" "$dst" 2>/dev/null && return 0 # already current -> no churn + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] deploy launcher -> $dst"; return 0; fi + install -m 0755 "$src" "$dst" + chown "$user:$user" "$dst" + log "deployed start-claude.sh -> $user" +} + +# Ensure the per-user NATIVE claude install (the recommended runtime: ~user/.local/bin/claude, +# self-updating) — used by BOTH the terminal launcher AND the user's t3-serve instance. We do +# NOT npm-install claude system-wide (npm/npx isn't the recommended runtime); each user gets +# their own native install. Idempotent: skip if already present. Runs the official native +# installer AS the user (into their ~/.local). Best-effort: a failure WARNs and retries next +# reconcile (start-claude.sh also self-bootstraps the terminal path). +install_user_claude_native() { + local user="$1" home + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" ]] || return 0 + [[ -x "$home/.local/bin/claude" ]] && return 0 # already native -> done + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] native claude install -> $user"; return 0; fi + if runuser -u "$user" -- bash -lc 'curl -fsSL https://claude.ai/install.sh | bash' >/dev/null 2>&1; then + log "installed native claude -> $user" + else + log "WARN: native claude install failed for $user (retries next reconcile)" + fi +} + +# Per-user playwright-mcp browser MCP — ALL tiers incl. admin (every user's Claude +# sessions connect to their OWN isolated server; a user's concurrent sessions are +# kept apart by the unit's --isolated). Idempotent + if-absent, so a routine +# reconcile never disturbs a live user: (1) seed the chrome-service snapshot token +# if the user has none; (2) wire the user-scope `playwright` MCP entry by running +# `claude mcp add` AS the user (writes THEIR ~/.claude.json, never reads another's; +# the CLI merges one key and REFUSES to clobber an existing one, so it's safe on a +# populated config), guarded by `claude mcp get`; (3) `enable --now` the system +# template instances (idempotent — does NOT restart an already-running server). +# Needs PLAYWRIGHT_PORT already in the per-user playwright env (written by the +# section-5c loop) + the token staged by setup-devvm.sh (section 8c). +install_playwright() { + local user="$1" home port token_staged=/etc/t3-serve/chrome-service-token + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" ]] || return 0 + port="$(grep -oE 'PLAYWRIGHT_PORT=[0-9]+' "$ENVDIR/playwright-$user.env" 2>/dev/null | cut -d= -f2 || true)" + [[ -n "$port" ]] || { log "WARN: no PLAYWRIGHT_PORT for $user -> skip playwright"; return 0; } + + # (1) chrome-service snapshot token, if-absent (0600, owned by the user) + if [[ ! -f "$home/.config/playwright/token" && -r "$token_staged" ]]; then + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] seed playwright token -> $user"; else + install -d -o "$user" -g "$user" -m 0700 "$home/.config/playwright" + install -o "$user" -g "$user" -m 0600 "$token_staged" "$home/.config/playwright/token" + log "seeded playwright snapshot token -> $user" + fi + fi + + # (2) wire user-scope ~/.claude.json (AS the user, login shell so the native + # ~/.local/bin/claude is on PATH; clobber-proof + if-absent via `mcp get`) + if [[ "$DRY_RUN" == 1 ]]; then + echo "[dry-run] wire playwright MCP (:$port) if-absent -> $user" + elif runuser -u "$user" -- bash -lc 'command -v claude >/dev/null 2>&1'; then + if ! runuser -u "$user" -- bash -lc 'claude mcp get playwright >/dev/null 2>&1'; then + runuser -u "$user" -- bash -lc "claude mcp add --scope user --transport http playwright 'http://localhost:$port/mcp' >/dev/null 2>&1" \ + && log "wired playwright MCP (user scope, :$port) -> $user" \ + || log "WARN: claude mcp add playwright failed for $user (retries next run)" + fi + else + log "WARN: claude not found for $user -> playwright MCP not wired (retries next run)" + fi + + # (3) enable the system template instances. `enable --now` is idempotent and + # does NOT restart a running unit, so a live user is undisturbed. + run systemctl enable --now "playwright-mcp@$user.service" >/dev/null 2>&1 || true + run systemctl enable --now "playwright-snapshot-refresh@$user.timer" >/dev/null 2>&1 || true +} + +# Per-user homelab-memory setup — migrate off the claude-memory MCP/plugin to the +# homelab CLI hooks (auto-recall + auto-learn + compaction backup/recovery). +# Idempotent, if-absent, ADDITIVE: never clobbers `env` (the per-user +# MEMORY_API_KEY) or other MCP servers; removes ONLY the `claude_memory` MCP. +# Reuses the user's existing key — does NOT mint one (per-user isolation stays +# deferred, design 2026-06-08). The homelab CLI (/usr/local/bin/homelab) hits the +# same remote HTTP API the MCP used. Hook scripts: $WORKSTATION_DIR/claude-hooks. +install_memory() { + local user="$1" home + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" ]] || return 0 + local src="$WORKSTATION_DIR/claude-hooks" hooks_dst="$home/.claude/hooks" settings="$home/.claude/settings.json" + [[ -d "$src" ]] || { log "WARN: $src missing -> skip memory setup for $user"; return 0; } + + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] memory: hooks + settings wire + claude_memory MCP removal -> $user"; return 0; fi + + # (1) (re)install the 4 hook scripts, owned by the user (refreshed each reconcile so fixes land) + install -d -o "$user" -g "$user" -m 0755 "$hooks_dst" + local h + for h in homelab-memory-recall.py auto-learn.py pre-compact-backup.sh post-compact-recovery.sh; do + install -o "$user" -g "$user" -m 0755 "$src/$h" "$hooks_dst/$h" + done + + # (2) wire the hooks in settings.json, if-absent + additive. Run the helper as ROOT: + # it must read $src under the admin's hardened home (mode 700), which a + # runuser-as-$user CANNOT traverse — so chown the result back to the user and + # enforce 0600 (it holds the per-user MEMORY_API_KEY). + if python3 "$src/wire-memory-hooks.py" "$home" >/dev/null 2>&1; then + [[ -f "$settings" ]] && chown "$user:$user" "$settings" 2>/dev/null || true + log "memory hooks wired -> $user" + else + log "WARN: memory hook wiring failed for $user (retries next reconcile)" + fi + [[ -f "$settings" ]] && chmod 600 "$settings" || true + + # (2b) reuse the user's existing key; warn (do NOT mint — needs an admin vault write) if absent. + if [[ -f "$settings" ]] && ! grep -q 'MEMORY_API_KEY' "$settings"; then + log "WARN: $user has no MEMORY_API_KEY in settings.json — homelab memory no-ops until an admin mints one" + fi + + # (3) remove the now-superseded claude_memory MCP (AS the user, if-present) + the plugin dir. + if runuser -u "$user" -- bash -lc 'command -v claude >/dev/null 2>&1 && claude mcp get claude_memory >/dev/null 2>&1'; then + runuser -u "$user" -- bash -lc 'claude mcp remove claude_memory >/dev/null 2>&1' && log "removed claude_memory MCP -> $user" || true + fi + if [[ -d "$home/.claude/plugins/claude-memory" ]]; then + rm -rf "$home/.claude/plugins/claude-memory" && log "removed claude-memory plugin dir -> $user" + fi + return 0 # best-effort tail must never return non-zero, else set -euo pipefail aborts the whole reconcile +} + +# Per-user agent skills, vendored from the in-repo snapshot ($WORKSTATION_DIR/claude-skills) — the +# `npx skills` upstream drifted off this exact set, so we reproduce it offline + deterministically. +# if-absent + ADDITIVE: copies a skill dir into ~/.agents/skills/<name> (owned by the user) and +# symlinks ~/.claude/skills/<name> -> ../../.agents/skills/<name> (the layout `skills add -g` +# produces; Claude Code reads ~/.claude/skills/). Scoped to SKILL_USERS. if-absent keys on the +# user's OWN copy, so it heals a stale/cross-user ~/.claude/skills symlink but never clobbers a real +# skill dir. Best-effort tail: must return 0 or set -euo pipefail aborts the whole reconcile. +install_skills() { + local user="$1" home + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" ]] || return 0 + case " $SKILL_USERS " in *" $user "*) ;; *) return 0 ;; esac + local src_root="$WORKSTATION_DIR/claude-skills" + [[ -d "$src_root" ]] || { log "WARN: $src_root missing -> skip skills for $user"; return 0; } + + if [[ "$DRY_RUN" == 1 ]]; then + local d names="" + for d in "$src_root"/*/; do [[ -d "$d" ]] && names+="$(basename "$d") "; done + echo "[dry-run] vendor skills if-absent -> $user: ${names}" + return 0 + fi + + local agents_dir="$home/.agents/skills" claude_dir="$home/.claude/skills" + # own the parent ~/.agents too (install -d leaves created intermediates root-owned) + install -d -o "$user" -g "$user" -m 0755 "$home/.agents" "$agents_dir" "$claude_dir" + chown "$user:$user" "$home/.agents" || true + + local skill name dst link n=0 + for skill in "$src_root"/*/; do + [[ -d "$skill" ]] || continue + name="$(basename "$skill")" + dst="$agents_dir/$name" + link="$claude_dir/$name" + # if-absent keys on the user's OWN copy (a real dir under ~/.agents/skills), NOT on any + # pre-existing ~/.claude/skills entry — so a stale or cross-user symlink gets healed. + if [[ ! -d "$dst" ]]; then + cp -a "$src_root/$name" "$dst" || { log "WARN: copy skill $name -> $user failed"; continue; } + chown -R "$user:$user" "$dst" || true + n=$((n+1)) + fi + # point ~/.claude/skills/<name> at the user's own copy (replacing a stale/cross-user symlink); + # never clobber a real dir/file squatting that name. + if [[ -d "$link" && ! -L "$link" ]]; then + log "WARN: $claude_dir/$name is a real dir (left as-is) for $user" + elif [[ "$(readlink "$link" 2>/dev/null)" != "../../.agents/skills/$name" ]]; then + ln -sfn "../../.agents/skills/$name" "$link" && chown -h "$user:$user" "$link" || log "WARN: link skill $name -> $user failed" + fi + done + if [[ "$n" -gt 0 ]]; then log "vendored/healed $n skill(s) -> $user"; fi + return 0 # best-effort tail must never return non-zero, else set -euo pipefail aborts the reconcile +} + [[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; } for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done [[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; } + +# 0) self-deploy: the repo is the authoring surface (like sync_managed_config / +# deploy_user_launcher below). Nothing else redeploys /usr/local/bin (only the +# manual setup-devvm.sh did) — so a committed edit silently never reached the +# hourly run until now (the homelab-memory rollout sat undeployed for a day). +# If the repo copy differs, install it and re-exec the fresh binary. Guarded: +# re-exec flag (no loop), bash -n (never deploy a broken script), DRY_RUN (no +# mutation), cmp (no churn when unchanged). +SELF_SRC="$WORKSTATION_DIR/../t3-provision-users.sh" +SELF_DST=/usr/local/bin/t3-provision-users +if [[ -z "${T3_PROVISION_SELF_DEPLOYED:-}" && -r "$SELF_SRC" ]] && ! cmp -s "$SELF_SRC" "$SELF_DST"; then + if [[ "$DRY_RUN" == 1 ]]; then + echo "[dry-run] self-deploy $SELF_DST from repo (changed)" + elif bash -n "$SELF_SRC" 2>/dev/null; then + install -m 0755 "$SELF_SRC" "$SELF_DST" + log "self-deployed $SELF_DST from repo (changed) — re-exec" + exec env T3_PROVISION_SELF_DEPLOYED=1 "$SELF_DST" "$@" + else + log "WARN: repo t3-provision-users.sh fails 'bash -n' — keeping deployed copy" + fi +fi + install -d -m 0755 "$ENVDIR" # 1) current sticky ports from existing .env files -> {os_user: port} -ports_file="$(mktemp)"; trap 'rm -f "$ports_file" "${desired_file:-}"' EXIT +ports_file="$(mktemp)"; pw_ports_file="$(mktemp)" +trap 'rm -f "$ports_file" "$pw_ports_file" "${desired_file:-}"' EXIT { echo "{}"; for f in "$ENVDIR"/*.env; do [[ -e "$f" ]] || continue - u="$(basename "$f" .env)"; p="$(grep -oE 'T3_PORT=[0-9]+' "$f" | cut -d= -f2)" + case "$(basename "$f")" in playwright-*) continue;; esac # not a t3-serve env (handled below) + # `|| true`: grep returns non-zero on no-match, which would abort under `set -e -o pipefail`. + u="$(basename "$f" .env)"; p="$(grep -oE 'T3_PORT=[0-9]+' "$f" | cut -d= -f2 || true)" [[ -n "$p" ]] && jq -n --arg u "$u" --argjson p "$p" '{($u): $p}' done; } | jq -s 'add' > "$ports_file" +# sticky PLAYWRIGHT ports from playwright-<os_user>.env (skipped by the loop above). +# Seeds roster_engine so the live per-user assignments stick across reconciles. +{ echo "{}"; for f in "$ENVDIR"/playwright-*.env; do + [[ -e "$f" ]] || continue + u="$(basename "$f" .env)"; u="${u#playwright-}" + p="$(grep -oE 'PLAYWRIGHT_PORT=[0-9]+' "$f" | cut -d= -f2 || true)" + [[ -n "$p" ]] && jq -n --arg u "$u" --argjson p "$p" '{($u): $p}' + done; } | jq -s 'add' > "$pw_ports_file" # 2) tier validation vs live k8s_users (best-effort; aborts only on a real conflict) if command -v vault >/dev/null; then @@ -124,11 +548,18 @@ fi # 3) derive desired state desired_file="$(mktemp)" -python3 "$ENGINE" derive --roster "$ROSTER" --ports-json "$ports_file" > "$desired_file" +python3 "$ENGINE" derive --roster "$ROSTER" --ports-json "$ports_file" --playwright-ports-json "$pw_ports_file" > "$desired_file" jq -e . "$desired_file" >/dev/null || { echo "[t3-provision] derive produced invalid JSON" >&2; exit 1; } +# 3b) machine-wide Claude managed config (repo -> /etc; per-user codex mirrors in the loop below) +sync_managed_config + # 4) per-account: create-if-absent + ADDITIVE tier groups (never strip) + locked clone -while IFS=$'\t' read -r os_user tier shell groups_csv; do +# NB: empty @tsv fields collapse under tab-IFS read (tab is IFS whitespace), so +# the jq below emits "-" for empty groups/repos and we map it back here. +while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do + [[ "$groups_csv" == "-" ]] && groups_csv="" + [[ "$repos_csv" == "-" ]] && repos_csv="" if ! id "$os_user" >/dev/null 2>&1; then log "create account: $os_user (shell $shell)" run useradd -m -s "$shell" "$os_user" @@ -144,21 +575,80 @@ while IFS=$'\t' read -r os_user tier shell groups_csv; do log "add $os_user -> group $g"; run gpasswd -a "$os_user" "$g" >/dev/null done fi - if [[ "$tier" != admin ]]; then # non-admins: locked ~/code clone + OIDC kubeconfig - install_locked_clone "$os_user" + if [[ "$tier" != admin ]]; then # non-admins: locked clone(s) (kept fresh) + kubeconfig + if [[ "$code_layout" == workspace ]]; then + ensure_workspace_layout "$os_user" + install_locked_clone "$os_user" code/infra + wire_forgejo_remote "$os_user" code/infra # before refresh: ff targets the canonical upstream same-pass + refresh_user_clone "$os_user" code/infra + IFS=',' read -ra extra_repos <<< "$repos_csv" + for repo in "${extra_repos[@]}"; do + [[ -n "$repo" ]] || continue + hoist_nested_repo "$os_user" "$repo" + install_user_repo "$os_user" "$repo" + refresh_user_clone "$os_user" "code/$repo" + done + else + install_locked_clone "$os_user" code + wire_forgejo_remote "$os_user" code # before refresh: ff targets the canonical upstream same-pass + refresh_user_clone "$os_user" code + fi install_user_kubeconfig "$os_user" + deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts) fi -done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (.groups|join(","))] | @tsv' "$desired_file") + refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd + install_user_claude_native "$os_user" # all tiers — per-user native claude (terminal + t3); no npm/npx + install_claude_auth_sync "$os_user" # all tiers — own Claude identity + isolated Vault recovery +done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (if (.groups|length)==0 then "-" else (.groups|join(",")) end), .code_layout, (if (.repos|length)==0 then "-" else (.repos|join(",")) end)] | @tsv' "$desired_file") # 5) per-user .env (sticky port) + enable t3-serve@ while IFS=$'\t' read -r os_user port; do envf="$ENVDIR/$os_user.env" - if [[ ! -f "$envf" ]] || ! grep -qx "T3_PORT=$port" "$envf"; then - run bash -c "printf 'T3_PORT=%s\n' '$port' > '$envf'" - fi + env_set "$envf" T3_PORT "$port" + # Per-user Enterprise login is authoritative. A legacy shared setup-token has + # higher credential precedence and would silently defeat user isolation. + env_unset "$envf" CLAUDE_CODE_OAUTH_TOKEN id "$os_user" >/dev/null 2>&1 && run systemctl enable --now "t3-serve@$os_user.service" >/dev/null 2>&1 || true done < <(jq -r '.ports | to_entries[] | [.key, .value] | @tsv' "$desired_file") +# 5c) per-user playwright-mcp (ALL tiers incl. admin): write the sticky +# PLAYWRIGHT_PORT to the per-user playwright env, then seed token + wire +# ~/.claude.json + enable the system template instances. if-absent / +# idempotent — never disturbs a live user's running server or existing config. +while IFS=$'\t' read -r os_user pw_port; do + id "$os_user" >/dev/null 2>&1 || continue + env_set "$ENVDIR/playwright-$os_user.env" PLAYWRIGHT_PORT "$pw_port" + install_playwright "$os_user" +done < <(jq -r '.playwright_ports | to_entries[] | [.key, .value] | @tsv' "$desired_file") + +# 5d) per-user homelab-memory (ALL users): replace the claude-memory MCP/plugin with the +# homelab CLI memory hooks. Idempotent + additive + if-absent; never touches the +# per-user MEMORY_API_KEY or other MCP servers (removes ONLY claude_memory). +while IFS=$'\t' read -r os_user; do + id "$os_user" >/dev/null 2>&1 || continue + install_memory "$os_user" +done < <(jq -r '.accounts[].os_user' "$desired_file") + +# 5e) per-user agent skills (SKILL_USERS allowlist only): vendored snapshot -> ~/.agents/skills +# + ~/.claude/skills symlinks. if-absent + additive; best-effort (never aborts the reconcile). +while IFS=$'\t' read -r os_user; do + id "$os_user" >/dev/null 2>&1 || continue + install_skills "$os_user" +done < <(jq -r '.accounts[].os_user' "$desired_file") + +# 5b) machine-wide (once, not per-user): keep the t3 gated nightly TRACKER timer enabled (it +# follows t3@nightly daily, gated; see t3-autoupdate.sh / docs/runbooks/t3-version-bump.md). +# NEVER --now: the tracker installs a NEW build + migrates DBs + restarts serves, so firing +# a missed run mid-day with users active is exactly the 2026-06-09 shape. `enable` (no --now) +# just arms the 04:00 schedule (the timer also dropped Persistent=true so a boot can't fire a +# missed bump). Fresh boxes get t3 from setup-devvm.sh's nightly install, not here. +run systemctl enable t3-autoupdate.timer >/dev/null 2>&1 || true +# tmux session persistence: periodic snapshot + boot-time restore (reboot +# survival for users' named claude sessions). Safe to --now: save is a +# read-only snapshot; restore is per-session idempotent. +run systemctl enable --now tmux-persist-save.timer >/dev/null 2>&1 || true +run systemctl enable tmux-persist-restore.service >/dev/null 2>&1 || true + # 6) regenerate /etc/ttyd-user-map + dispatch.json from the desired state (SSoT: # a roster entry removed here DISAPPEARS, which is what the offboarding cut relies on) if [[ "$DRY_RUN" == 1 ]]; then diff --git a/scripts/t3-safe-restart.sh b/scripts/t3-safe-restart.sh new file mode 100644 index 00000000..63a6c455 --- /dev/null +++ b/scripts/t3-safe-restart.sh @@ -0,0 +1,96 @@ +#!/usr/bin/env bash +# t3-safe-restart.sh — SOURCED library (not executed). Shared by t3-autoupdate.sh +# (daily gated tracker) and t3-migrate-idle.sh (overnight deferral drainer). +# +# Holds the per-unit "dangerous" routine — backup -> restart -> verify pairing -> +# recover (restore DB + roll global binary back to last-good + freeze) — extracted +# verbatim from t3-autoupdate.sh step 6, plus the small helpers it depends on. +# The only change from the inline original: safe_restart_unit RETURNS non-zero on +# failure (after performing recovery+freeze) instead of `exit 1`, so the CALLER +# decides what to do (the daily job exits; the idle job stops draining). +# +# Callers must set, before calling safe_restart_unit: $target (version being moved +# TO, for log lines + the prebump filename) and $last_good (rollback target). +# Set $LOG_TAG before sourcing to tag syslog ("t3-autoupdate" / "t3-migrate-idle"). + +# ---- shared config defaults (honour the original T3_* override names) ----------- +: "${LOG_TAG:=t3-safe-restart}" +: "${FREEZE_FILE:=${T3_FREEZE_FILE:-/etc/t3-autoupdate.freeze}}" +: "${STATE_DIR:=${T3_STATE_DIR:-/var/lib/t3-autoupdate}}" +: "${LAST_GOOD_FILE:=$STATE_DIR/last-good}" +: "${DEFER_DIR:=$STATE_DIR/deferred}" +: "${BACKUP_DIR:=${T3_BACKUP_DEST:-/var/backups/t3-state}}" +: "${DISPATCH:=${T3_DISPATCH:-127.0.0.1:3780}}" +: "${USER_MAP:=${T3_USER_MAP:-/etc/ttyd-user-map}}" +: "${T3_BACKUP_TIMEOUT:=900}" + +LOG() { logger -t "$LOG_TAG" "$*"; echo "$LOG_TAG: $*"; } +ver() { t3 --version 2>/dev/null | awk '{print $NF}' | sed 's/^v//'; } +# OS users owning a ~/.t3 (RHS of each non-comment "authentik=os_user" map line). +osusers() { awk -F= '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$2);print $2}' "$USER_MAP" 2>/dev/null | sort -u; } +# authentik username for an OS user (reverse map; first match) — for dispatch verify. +ak_for() { awk -F= -v u="$1" '!/^[[:space:]]*#/&&NF==2{gsub(/[[:space:]]/,"",$1);gsub(/[[:space:]]/,"",$2);if($2==u){print $1;exit}}' "$USER_MAP" 2>/dev/null; } + +# Online consistent snapshot of ONE user's state.sqlite (run AS the owner so the +# WAL stays owned; never stops the serve). Uses global $target for the filename. +# Echoes the backup path on success; non-zero on failure. +backup_user() { + local u="$1" src out dst ts + src="/home/$u/.t3/userdata/state.sqlite"; [ -f "$src" ] || return 1 + ts="$(date +%Y%m%d-%H%M%S)" + out="$BACKUP_DIR/$u"; dst="$out/state-prebump-$target-$ts.sqlite" + install -d -o "$u" -g "$u" -m700 "$out" 2>/dev/null || mkdir -p "$out" + if runuser -u "$u" -- timeout "$T3_BACKUP_TIMEOUT" sqlite3 "$src" "VACUUM INTO '$dst'" 2>/dev/null && [ -s "$dst" ]; then + printf '%s\n' "$dst"; return 0 + fi + rm -f "$dst"; return 1 +} + +# newest pre-bump backup for a user taken for the current $target (restore source). +prebump_of() { ls -1t "$BACKUP_DIR/$1/state-prebump-$target-"*.sqlite 2>/dev/null | head -1; } + +# roll the GLOBAL binary back to last-good. In the idle path last_good==installed, +# so this is a harmless no-op reinstall (does NOT downgrade other users). +rollback_binary() { + LOG "rolling back binary $target -> $last_good" + if npm i -g "t3@$last_good" >/dev/null 2>&1; then LOG "rolled back to $last_good"; return 0; fi + LOG "ROLLBACK FAILED — could not reinstall t3@$last_good (t3 may be broken; manual fix per runbook)"; return 1 +} + +# verify a user's pairing through the REAL dispatch (mint -> exchange -> cookie). +verify_pairing() { + local u="$1" ak out; ak="$(ak_for "$u")"; [ -n "$ak" ] || { LOG "no authentik mapping for $u — skipping dispatch verify"; return 0; } + out="$(curl -s -i --max-time 10 -H "X-authentik-username: $ak" -H 'Sec-Fetch-Dest: document' "http://$DISPATCH/" 2>/dev/null)" + printf '%s' "$out" | grep -qi '^set-cookie:[[:space:]]*t3_session=' +} + +# safe_restart_unit <unit> <user>: restart the unit, verify pairing; on failure +# restore the user's DB from its pre-restart backup, roll the binary back, freeze. +# Assumes a pre-restart backup already exists for <user> at the current $target +# (the daily job's backup_all, or the idle job's backup_user, takes it first). +# Returns 0 on verified success, non-zero after recovery+freeze on failure. +safe_restart_unit() { + local unit="$1" u="$2" ok=0 _ bak + systemctl restart "$unit" || LOG "WARN: systemctl restart $unit returned non-zero" + for _ in $(seq 1 15); do + if verify_pairing "$u"; then ok=1; break; fi + sleep 2 + done + if [ "$ok" = "1" ]; then + LOG "restarted $unit -> $target (pairing verified via dispatch)"; return 0 + fi + LOG "HEALTH-CHECK FAILED: $u pairing broken AFTER restart onto $target — rolling back + restoring its DB" + rollback_binary + bak="$(prebump_of "$u")" + if [ -n "$bak" ]; then + systemctl stop "$unit" 2>/dev/null + if install -o "$u" -g "$u" -m600 "$bak" "/home/$u/.t3/userdata/state.sqlite" 2>/dev/null; then + rm -f "/home/$u/.t3/userdata/state.sqlite-wal" "/home/$u/.t3/userdata/state.sqlite-shm" + LOG "restored $u state.sqlite from $bak" + fi + systemctl start "$unit" 2>/dev/null + fi + touch "$FREEZE_FILE" 2>/dev/null + LOG "FROZEN ($FREEZE_FILE) after $u failed on $target; last_good stays $last_good — investigate, then remove the freeze file to resume" + return 1 +} diff --git a/scripts/t3-serve@.service b/scripts/t3-serve@.service index 9cdb4ad9..4109b36b 100644 --- a/scripts/t3-serve@.service +++ b/scripts/t3-serve@.service @@ -15,6 +15,17 @@ WorkingDirectory=/home/%i ExecStart=/usr/bin/t3 serve --host 0.0.0.0 --port ${T3_PORT} --base-dir /home/%i/.t3 Restart=on-failure RestartSec=5 +# Memory containment (2026-06-10): agent children live in this cgroup; a +# runaway agent (10.8G anon on a 23G host) swap-thrashed the whole devvm — +# every >20s stall fires the t3 client watchdog (visible "disconnects") — +# then global-OOMed. Cap the cgroup so a runaway OOMs early and locally, +# and forbid swap so stalls can't smear into minutes-long freezes. +MemoryHigh=12G +MemoryMax=16G +MemorySwapMax=0 +# Default OOMPolicy=stop kills the WHOLE unit (8.5min outage 2026-06-10 +# 19:56) when ANY child is OOM-killed; continue = runaway dies, server stays. +OOMPolicy=continue [Install] WantedBy=multi-user.target diff --git a/scripts/test-claude-auth-sync.sh b/scripts/test-claude-auth-sync.sh new file mode 100755 index 00000000..10f07746 --- /dev/null +++ b/scripts/test-claude-auth-sync.sh @@ -0,0 +1,32 @@ +#!/usr/bin/env bash +set -uo pipefail +DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=workstation/claude-auth-sync.sh +source "$DIR/workstation/claude-auth-sync.sh" + +pass=0 fail=0 +ok() { if "${@:2}"; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $1"; fi; } +no() { if "${@:2}"; then fail=$((fail+1)); echo "FAIL: $1"; else pass=$((pass+1)); fi; } +eq() { if [[ "$2" == "$3" ]]; then pass=$((pass+1)); else fail=$((fail+1)); echo "FAIL: $1"; fi; } + +tmp="$(mktemp -d)"; trap 'rm -rf "$tmp"' EXIT +valid='{"mcpOAuth":{"server":{"accessToken":"mcp-secret"}},"claudeAiOauth":{"accessToken":"access","refreshToken":"refresh","expiresAt":123,"scopes":["user:inference"]}}' +printf '%s\n' "$valid" > "$tmp/credentials.json" + +oauth="$(cas_oauth_from_credentials "$tmp/credentials.json")" +eq "extract OAuth object" 'access' "$(jq -r .accessToken <<<"$oauth")" +printf '{"claudeAiOauth":{"accessToken":"access","expiresAt":123}}\n' > "$tmp/bad.json" +no "reject missing refresh token" cas_oauth_from_credentials "$tmp/bad.json" + +replacement='{"accessToken":"new-access","refreshToken":"new-refresh","expiresAt":456}' +merged="$(cas_merge_oauth "$tmp/credentials.json" "$replacement")" +eq "replace Claude access token" new-access "$(jq -r .claudeAiOauth.accessToken <<<"$merged")" +eq "preserve MCP OAuth" mcp-secret "$(jq -r '.mcpOAuth.server.accessToken' <<<"$merged")" + +export CAS_USER=emo +ok "accept own scoped Vault token" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-emo +no "reject another user's token" cas_vault_identity_ok token-devvm-claude-auth-anca default,workstation-claude-anca +no "reject wrong policy" cas_vault_identity_ok token-devvm-claude-auth-emo default,workstation-claude-anca + +printf '\n%d passed, %d failed\n' "$pass" "$fail" +(( fail == 0 )) diff --git a/scripts/test-fan-control.sh b/scripts/test-fan-control.sh index a42e24a9..65fae7c6 100644 --- a/scripts/test-fan-control.sh +++ b/scripts/test-fan-control.sh @@ -1,6 +1,7 @@ #!/usr/bin/env bash -# Unit tests for the pure functions in fan-control.sh. -# Sources the script (main is guarded), exercises curve/decide/resolve/presence/parse. +# Unit tests for the pure functions in fan-control.sh (the thin actuator). +# The control math lives in Home Assistant now; the daemon only validates and +# applies the HA-computed command, so these cover the I/O-adjacent pure helpers. # Run: bash infra/scripts/test-fan-control.sh set -uo pipefail @@ -14,43 +15,31 @@ eq() { # <description> <expected> <actual> fail=$((fail + 1)); printf 'FAIL: %s — expected [%s] got [%s]\n' "$1" "$2" "$3" fi } +ok() { # <description> <cmd...> (passes if cmd exits 0) + if "${@:2}"; then pass=$((pass + 1)); else fail=$((fail + 1)); printf 'FAIL: %s — expected exit 0\n' "$1"; fi +} +no() { # <description> <cmd...> (passes if cmd exits non-zero) + if "${@:2}"; then fail=$((fail + 1)); printf 'FAIL: %s — expected non-zero exit\n' "$1"; else pass=$((pass + 1)); fi +} -# --- COOL curve (continuous linear: 30% @50C .. 100% @83C) --- -eq "cool <=T_LO clamps" 30 "$(fc_curve cool 40)" -eq "cool 50 -> 30" 30 "$(fc_curve cool 50)" -eq "cool 55 -> 41" 41 "$(fc_curve cool 55)" -eq "cool 60 -> 51" 51 "$(fc_curve cool 60)" -eq "cool 64 -> 60" 60 "$(fc_curve cool 64)" -eq "cool 70 -> 72" 72 "$(fc_curve cool 70)" -eq "cool 75 -> 83" 83 "$(fc_curve cool 75)" -eq "cool 83 -> 100" 100 "$(fc_curve cool 83)" -eq "cool >=T_HI clamps" 100 "$(fc_curve cool 90)" +# --- fc_num: sanitise the HA command read (truncate floats, fallback, clamp) --- +eq "num valid" 55 "$(fc_num 55 0 0 100)" +eq "num float trunc" 55 "$(fc_num 55.7 0 0 100)" +eq "num empty->fb" 0 "$(fc_num '' 0 0 100)" +eq "num garbage->fb" 0 "$(fc_num abc 0 0 100)" +eq "num clamp low" 0 "$(fc_num -5 0 0 100)" +eq "num clamp high" 100 "$(fc_num 150 0 0 100)" -# --- QUIET curve (continuous linear: 20% @68C .. 100% @83C) --- -eq "quiet <=T_LO clamps" 20 "$(fc_curve quiet 60)" -eq "quiet 68 -> 20" 20 "$(fc_curve quiet 68)" -eq "quiet 70 -> 31" 31 "$(fc_curve quiet 70)" -eq "quiet 75 -> 57" 57 "$(fc_curve quiet 75)" -eq "quiet 80 -> 84" 84 "$(fc_curve quiet 80)" -eq "quiet 83 -> 100" 100 "$(fc_curve quiet 83)" +# --- fc_fresh: staleness gate on the command's last_updated age --- +ok "fresh well within" fc_fresh 30 120 +ok "fresh at boundary" fc_fresh 120 120 +no "stale just past" fc_fresh 121 120 +no "stale way past" fc_fresh 600 120 -# --- decide: asymmetric hysteresis (ramp up now, ease down only past the deadband) --- -eq "decide uninit -> target" 68 "$(fc_decide cool 68 -1 3)" -eq "decide ramp up now" 68 "$(fc_decide cool 68 25 3)" -eq "decide equal holds" 62 "$(fc_decide cool 65 62 3)" -eq "decide down held" 72 "$(fc_decide cool 68 72 3)" # curve(68)=68<72 but curve(71)=75 !<72 -> hold -eq "decide down past" 60 "$(fc_decide cool 64 72 3)" # curve(64)=60, curve(67)=66<72 -> drop - -# --- fc_clamp / fc_resolve: HA mode resolution --- +# --- fc_clamp --- eq "clamp over 100" 100 "$(fc_clamp 150)" eq "clamp under 0" 0 "$(fc_clamp -5)" eq "clamp passthrough" 45 "$(fc_clamp 45)" -eq "resolve manual=slider" 42 "$(fc_resolve manual 64 42 cool -1 3)" -eq "resolve manual clamped" 100 "$(fc_resolve manual 64 150 cool -1 3)" -eq "resolve cool=cool curve" 51 "$(fc_resolve cool 60 0 cool -1 3)" -eq "resolve quiet=quiet curve" 73 "$(fc_resolve quiet 78 0 cool -1 3)" -eq "resolve auto+empty=cool" 51 "$(fc_resolve auto 60 0 cool -1 3)" -eq "resolve auto+present=quiet" 31 "$(fc_resolve auto 70 0 quiet -1 3)" # --- fc_fan_watts: estimated fan power from RPM (cube-law, calibrated to the sweep) --- eq "fan_watts 0" 0 "$(fc_fan_watts 0)" @@ -59,21 +48,14 @@ eq "fan_watts 9360" 16 "$(fc_fan_watts 9360)" eq "fan_watts 12720" 42 "$(fc_fan_watts 12720)" eq "fan_watts 16920" 99 "$(fc_fan_watts 16920)" -# --- presence --- -now=1000000 -eq "presence open -> quiet" quiet "$(fc_presence_mode Отворена 0 $now 900 Отворена)" -eq "presence closed recent -> quiet" quiet "$(fc_presence_mode Затворена $((now - 100)) $now 900 Отворена)" -eq "presence closed stale -> cool" cool "$(fc_presence_mode Затворена $((now - 1000)) $now 900 Отворена)" -eq "presence closed edge -> cool" cool "$(fc_presence_mode Затворена $((now - 900)) $now 900 Отворена)" - # --- temp parsing --- eq "parse temp line" 74 "$(fc_parse_temp 'Temp | 0Eh | ok | 3.1 | 74 degrees C')" eq "parse temp 7C" 72 "$(fc_parse_temp 'Temp | 0Eh | ok | 3.1 | 72 degrees C')" -# --- json field (jq-free) --- -J='{"entity_id":"sensor.garage_door_state_bg","state":"Отворена","attributes":{"friendly_name":"Garage Door State BG"},"last_changed":"2026-06-04T16:55:20.517745+00:00","last_updated":"2026-06-04T16:55:20.517745+00:00"}' -eq "json state" "Отворена" "$(fc_json_str_field "$J" state)" -eq "json last_changed" "2026-06-04T16:55:20.517745+00:00" "$(fc_json_str_field "$J" last_changed)" +# --- json field (jq-free): state + last_updated parsing for the command read --- +J='{"entity_id":"sensor.r730_fan_command_pct","state":"57","attributes":{"unit_of_measurement":"%"},"last_changed":"2026-06-08T16:55:20.517745+00:00","last_updated":"2026-06-08T16:55:25.000000+00:00"}' +eq "json state" "57" "$(fc_json_str_field "$J" state)" +eq "json last_updated" "2026-06-08T16:55:25.000000+00:00" "$(fc_json_str_field "$J" last_updated)" # --- hex conversion --- eq "hex 20" 0x14 "$(fc_pct_to_hex 20)" diff --git a/scripts/test_tg_lock_timeout.py b/scripts/test_tg_lock_timeout.py new file mode 100644 index 00000000..263e5a74 --- /dev/null +++ b/scripts/test_tg_lock_timeout.py @@ -0,0 +1,102 @@ +#!/usr/bin/env python3 +"""Tests for scripts/tg lock-timeout injection. + +scripts/tg wraps terragrunt. Tier-1 stacks rely on terraform's pg-backend +state lock; without -lock-timeout an apply fails instantly ("Error acquiring +the state lock") whenever anything else holds the lock — a Woodpecker-killed +run whose PG advisory lock has not been reaped yet, a concurrent local apply, +or the daily drift `plan`. This was the single largest cause of infra CI +failures. These tests pin that tg injects -lock-timeout for state-locking +verbs (and still preserves -auto-approve for non-interactive applies), so a +contended lock waits rather than fails. + +Hermetic: a stub `terragrunt` on PATH records the args tg forwards; PG_CONN_STR +is pre-set so the Tier-1 Vault credential fetch is skipped (no network/Vault). +""" +import os +import shutil +import subprocess +from pathlib import Path + +import pytest + +SCRIPTS_DIR = Path(__file__).resolve().parent +TG = SCRIPTS_DIR / "tg" +AUTH_CHECK = SCRIPTS_DIR / "check-ingress-auth-comments.py" + + +def _run(tmp_path, *tg_args, env_extra=None): + """Run a copy of scripts/tg in an isolated fake repo; return forwarded args.""" + repo = tmp_path / "repo" + (repo / "scripts").mkdir(parents=True) + shutil.copy(TG, repo / "scripts" / "tg") + shutil.copy(AUTH_CHECK, repo / "scripts" / "check-ingress-auth-comments.py") + os.chmod(repo / "scripts" / "tg", 0o755) + os.chmod(repo / "scripts" / "check-ingress-auth-comments.py", 0o755) + + # Fake Tier-1 stack ("faketest" is NOT in TIER0_STACKS), no ingress auth lines. + stack = repo / "stacks" / "faketest" + stack.mkdir(parents=True) + (stack / "terragrunt.hcl").write_text("# fake\n") + (stack / "main.tf").write_text("# no ingress_factory auth lines here\n") + + # Stub terragrunt: append every forwarded arg (one per line) to a capture file. + bindir = tmp_path / "bin" + bindir.mkdir() + capture = tmp_path / "tg_args.txt" + stub = bindir / "terragrunt" + stub.write_text( + "#!/usr/bin/env bash\n" + f'for a in "$@"; do echo "$a" >> "{capture}"; done\n' + "exit 0\n" + ) + os.chmod(stub, 0o755) + + env = dict(os.environ) + env["PATH"] = f"{bindir}:{env['PATH']}" + env["PG_CONN_STR"] = "postgres://stub" # skip the Tier-1 Vault cred fetch + env["TF_PLUGIN_CACHE_DIR"] = str(tmp_path / "plugin-cache") + if env_extra: + env.update(env_extra) + + proc = subprocess.run( + ["bash", str(repo / "scripts" / "tg"), *tg_args], + cwd=str(stack), + env=env, + capture_output=True, + text=True, + ) + assert proc.returncode == 0, f"tg exited {proc.returncode}\nSTDERR:\n{proc.stderr}\nSTDOUT:\n{proc.stdout}" + return capture.read_text().splitlines() if capture.exists() else [] + + +def test_apply_non_interactive_has_lock_timeout_and_auto_approve(tmp_path): + args = _run(tmp_path, "apply", "--non-interactive") + assert "apply" in args + assert "-auto-approve" in args, "non-interactive apply must keep -auto-approve" + assert "-lock-timeout=5m" in args, "apply must wait for a contended state lock" + + +def test_plan_has_lock_timeout_but_not_auto_approve(tmp_path): + args = _run(tmp_path, "plan") + assert "plan" in args + assert "-lock-timeout=5m" in args + assert "-auto-approve" not in args, "plan must never get -auto-approve" + + +@pytest.mark.parametrize("verb", ["destroy", "refresh"]) +def test_locking_verb_gets_lock_timeout(tmp_path, verb): + args = _run(tmp_path, verb) + assert "-lock-timeout=5m" in args, f"{verb} should carry -lock-timeout" + + +def test_non_locking_verb_has_no_lock_timeout(tmp_path): + # validate does not take a state lock — must not carry -lock-timeout. + args = _run(tmp_path, "validate") + assert "validate" in args + assert not any(a.startswith("-lock-timeout") for a in args) + + +def test_lock_timeout_is_env_overridable(tmp_path): + args = _run(tmp_path, "plan", env_extra={"TG_LOCK_TIMEOUT": "2m"}) + assert "-lock-timeout=2m" in args diff --git a/scripts/tg b/scripts/tg index b9e9f0da..b0574f89 100755 --- a/scripts/tg +++ b/scripts/tg @@ -13,6 +13,15 @@ export TF_PLUGIN_CACHE_DIR="${TF_PLUGIN_CACHE_DIR:-$HOME/.terraform.d/plugin-cac export TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE=1 mkdir -p "$TF_PLUGIN_CACHE_DIR" +# State-lock wait window. Tier-1 stacks lock their state via terraform's pg +# backend (pg_advisory_lock); with no timeout an apply fails instantly +# ("Error acquiring the state lock") the moment anything else holds the lock — +# a Woodpecker-killed run whose lock PG hasn't reaped yet, a concurrent local +# apply, or the daily drift `plan`. Waiting a few minutes absorbs all of those +# (the holder finishes, or PG reaps the dead backend). This was the #1 cause of +# infra CI failures. Override with TG_LOCK_TIMEOUT (e.g. 0 to fail fast). +LOCK_TIMEOUT="${TG_LOCK_TIMEOUT:-5m}" + # Determine stack name from cwd (relative to stacks/) STACK_NAME="" cwd="$(pwd)" @@ -134,29 +143,30 @@ if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then fi fi -# If running apply with --non-interactive, add -auto-approve for Terraform +# Build the terragrunt invocation: +# - add -auto-approve right after `apply` for --non-interactive runs (CI) +# - add -lock-timeout for state-locking verbs (plan/apply/destroy/refresh) so +# a contended state lock WAITS instead of failing instantly (see +# LOCK_TIMEOUT above). Non-locking verbs (init/validate/output/fmt) skip it. args=("$@") -has_apply=false has_non_interactive=false for arg in "${args[@]}"; do case "$arg" in - apply) has_apply=true ;; --non-interactive) has_non_interactive=true ;; esac done -if $has_apply && $has_non_interactive; then - new_args=() - for arg in "${args[@]}"; do - new_args+=("$arg") - if [ "$arg" = "apply" ]; then - new_args+=("-auto-approve") - fi - done - terragrunt "${new_args[@]}" -else - terragrunt "$@" +tg_args=() +for arg in "${args[@]}"; do + tg_args+=("$arg") + if [ "$arg" = "apply" ] && $has_non_interactive; then + tg_args+=("-auto-approve") + fi +done +if $is_tf_op; then + tg_args+=("-lock-timeout=$LOCK_TIMEOUT") fi +terragrunt "${tg_args[@]}" # After mutating operations: encrypt+commit (Tier 0) or no-op (Tier 1 — PG is authoritative) if $is_mutating && [ -n "$STACK_NAME" ] && is_tier0 "$STACK_NAME"; then diff --git a/scripts/tmux-persist-restore.service b/scripts/tmux-persist-restore.service new file mode 100644 index 00000000..62c61d20 --- /dev/null +++ b/scripts/tmux-persist-restore.service @@ -0,0 +1,12 @@ +[Unit] +Description=Restore workstation tmux sessions (resume saved claude conversations) after boot +After=network.target local-fs.target +# Before the save timer's first run (OnBootSec=10min) so an empty post-boot +# state can never be snapshotted over the manifest being restored from. + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/tmux-persist restore + +[Install] +WantedBy=multi-user.target diff --git a/scripts/tmux-persist-save.service b/scripts/tmux-persist-save.service new file mode 100644 index 00000000..deecf541 --- /dev/null +++ b/scripts/tmux-persist-save.service @@ -0,0 +1,6 @@ +[Unit] +Description=Snapshot workstation tmux sessions (name -> claude conversation) for reboot survival + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/tmux-persist save diff --git a/scripts/tmux-persist-save.timer b/scripts/tmux-persist-save.timer new file mode 100644 index 00000000..b230aee2 --- /dev/null +++ b/scripts/tmux-persist-save.timer @@ -0,0 +1,10 @@ +[Unit] +Description=Periodic workstation tmux session snapshot + +[Timer] +OnBootSec=10min +OnCalendar=*:0/5 +Persistent=false + +[Install] +WantedBy=timers.target diff --git a/scripts/tmux-persist.sh b/scripts/tmux-persist.sh new file mode 100644 index 00000000..5afac3a4 --- /dev/null +++ b/scripts/tmux-persist.sh @@ -0,0 +1,135 @@ +#!/usr/bin/env bash +# Persist WEB-TERMINAL (ttyd/tmux) sessions across devvm reboots. +# +# Scope: the tmux-based web terminal only. The t3 chat surface persists its +# own threads (~/.t3 state.sqlite, backed up daily by t3-backup-state) — this +# script is about the tmux sessions, which are otherwise memory-only. Users +# come from /etc/ttyd-user-map (the terminal surface's roster-derived map). +# +# save — snapshot every roster user's live tmux sessions to +# /var/lib/tmux-persist/<user>.tsv (name, cwd, claude session +# uuid). The uuid is sniffed from the claude process's OPEN +# transcript fd (~/.claude/projects/<slug>/<uuid>.jsonl), so it is +# correct regardless of how the session was launched (fresh via +# start-claude.sh or an explicit --resume). Runs every 5 min via +# tmux-persist-save.timer. A snapshot that captures no live sessions +# (no server, OR a stale socket left behind by an OOM-killed server) +# keeps the user's last manifest, so it can't be wiped before restore. +# restore — recreate manifest sessions that don't currently exist, resuming +# each saved conversation (claude --resume <uuid>). Per-session +# idempotent: existing names are left alone, so it is safe both at +# boot (tmux-persist-restore.service) and after a partial loss. +# +# v1 limitation: one window/pane per session is captured (the workstation +# usage pattern — one named claude conversation per tmux session). +set -euo pipefail + +STATE_DIR=/var/lib/tmux-persist +MAP=/etc/ttyd-user-map +MODE="${1:-}" + +log() { echo "[tmux-persist] $*"; } + +users() { [[ -r "$MAP" ]] && cut -d= -f2 "$MAP" | sort -u; } + +tmux_as() { local u="$1"; shift; runuser -u "$u" -- tmux "$@"; } + +# First descendant of $1 whose comm is `claude` (BFS, bounded by process tree). +claude_pid_under() { + local q=("$1") pid kids + while ((${#q[@]})); do + pid="${q[0]}"; q=("${q[@]:1}") + [[ "$(ps -o comm= -p "$pid" 2>/dev/null)" == claude ]] && { echo "$pid"; return 0; } + read -ra kids <<<"$(pgrep -P "$pid" 2>/dev/null | tr '\n' ' ')" || true + ((${#kids[@]})) && q+=("${kids[@]}") + done + return 1 +} + +# Conversation uuid of a claude process ($1 pid, $2 user, $3 cwd). Two sources +# (claude does NOT hold its transcript fd open, so fd-sniffing doesn't work): +# 1. argv `--resume <uuid>` — covers every session this script's restore (or a +# manual recovery) created, making the save/restore loop self-sustaining; +# 2. newest <uuid>.jsonl in the user's cwd-slug project dir created at/after +# the process start — covers fresh launcher-started sessions. +# Always returns 0; empty output means "no conversation" (restored as a shell). +uuid_of_claude() { + local uuid slug dir start f + uuid="$(tr '\0' '\n' < "/proc/$1/cmdline" 2>/dev/null \ + | grep -A1 -x -- '--resume' | tail -1 \ + | grep -oE '^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$' || true)" + [[ -n "$uuid" ]] && { echo "$uuid"; return 0; } + slug="${3//\//-}"; slug="${slug//./-}" + dir="$(getent passwd "$2" | cut -d: -f6)/.claude/projects/$slug" + [[ -d "$dir" ]] || return 0 + start=$(( $(date +%s) - $(ps -o etimes= -p "$1" 2>/dev/null | tr -d ' ' || echo 0) - 5 )) + f="$(find "$dir" -maxdepth 1 -name '*.jsonl' -newermt "@$start" -printf '%T@ %f\n' 2>/dev/null \ + | sort -rn | head -1 | awk '{print $2}' || true)" + [[ -n "$f" ]] && echo "${f%.jsonl}" + return 0 +} + +save() { + install -d -m 0755 "$STATE_DIR" + local u uid sess pane_pid pane_cwd cpid uuid tmp n + for u in $(users); do + uid="$(id -u "$u" 2>/dev/null)" || continue + [[ -S "/tmp/tmux-$uid/default" ]] || continue # no socket at all -> keep last manifest + tmp="$(mktemp)" + while IFS=$'\t' read -r sess pane_pid pane_cwd; do + [[ -n "$sess" ]] || continue + uuid="" + if cpid="$(claude_pid_under "$pane_pid")"; then uuid="$(uuid_of_claude "$cpid" "$u" "$pane_cwd")"; fi + printf '%s\t%s\t%s\n' "$sess" "$pane_cwd" "$uuid" >> "$tmp" + done < <(tmux_as "$u" list-panes -a -F $'#{session_name}\t#{pane_pid}\t#{pane_current_path}' 2>/dev/null \ + | sort -u -t$'\t' -k1,1) + # Only overwrite the manifest when we captured >=1 live session. A socket + # file can outlive its server (an OOM-killed tmux server leaves + # /tmp/tmux-<uid>/default behind); list-panes then yields nothing, and + # installing that empty result would clobber a good manifest right before + # restore needs it. Empty capture -> keep the last good manifest. + n=$(wc -l < "$tmp") + if (( n > 0 )); then + install -m 0600 "$tmp" "$STATE_DIR/$u.tsv" + log "saved $n session(s) for $u" + else + log "no live sessions for $u (stale socket or dead server) — keeping last manifest" + fi + rm -f "$tmp" + done +} + +restore() { + local only="${1:-}" u f sess cwd uuid cmd + # Optional single-user restore: `tmux-persist restore <user>` limits the + # action to one terminal user (the web-UI restore button calls this via the + # tmux-restore-user wrapper). No arg => restore every user (the boot service). + if [[ -n "$only" ]] && ! users | grep -qxF "$only"; then + echo "[tmux-persist] restore: '$only' is not a known terminal user" >&2 + return 2 + fi + for u in $(users); do + [[ -z "$only" || "$u" == "$only" ]] || continue + f="$STATE_DIR/$u.tsv" + [[ -s "$f" ]] || continue + while IFS=$'\t' read -r sess cwd uuid; do + [[ -n "$sess" ]] || continue + tmux_as "$u" has-session -t "=$sess" 2>/dev/null && continue # already live + [[ -d "$cwd" ]] || cwd="$(getent passwd "$u" | cut -d: -f6)" + if [[ -n "$uuid" ]]; then + cmd="claude --dangerously-skip-permissions --resume $uuid --name \"$sess\"; echo; echo ' claude exited — shell preserved'; exec bash -l" + else + cmd="exec bash -l" + fi + tmux_as "$u" new-session -d -s "$sess" -c "$cwd" "$cmd" \ + && log "restored $u:$sess${uuid:+ (resume ${uuid:0:8})}" \ + || log "WARN: failed to restore $u:$sess" + done < "$f" + done +} + +case "$MODE" in + save) save ;; + restore) restore "${2:-}" ;; + *) echo "usage: tmux-persist save | restore [user]" >&2; exit 1 ;; +esac diff --git a/scripts/update_k8s.sh b/scripts/update_k8s.sh index 6e01d654..19abe7ef 100755 --- a/scripts/update_k8s.sh +++ b/scripts/update_k8s.sh @@ -82,7 +82,10 @@ sudo apt-get install -y "kubeadm=$RELEASE-*" if [[ "$ROLE" == "master" ]]; then echo "==> Master path: kubeadm upgrade plan + apply" - sudo kubeadm upgrade plan + # `plan` runs the same CoreDNS preflight as `apply`, so once master's kubeadm + # is on the new version it fails here too (under set -e) — ignore the same + # two CoreDNS checks. See the apply block below for the full rationale. + sudo kubeadm upgrade plan --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins # The first apply may fail with "static Pod hash for component <X> did # not change after 5m0s" — kubeadm's 5min wait for the kubelet to reload # a static pod is too tight on our cluster (apiserver-to-kubelet status @@ -98,7 +101,20 @@ if [[ "$ROLE" == "master" ]]; then # right version (which is the only case where this timeout fires). attempt=1 extra_flags="" - while ! sudo kubeadm upgrade apply "v$RELEASE" -y $extra_flags; do + # CoreDNS is managed OUTSIDE kubeadm on this cluster: the Corefile is a + # custom split-horizon config owned by the technitium stack, and the image + # is intentionally tracked separately. kubeadm's bundled corefile-migration + # library rejects CoreDNS versions it doesn't know (e.g. 1.12.4 -> "start + # version not supported"), which HARD-FAILS `upgrade apply` at preflight. + # Forcing past preflight with --ignore alone is NOT enough — kubeadm would + # then overwrite our custom Corefile with its default AND downgrade the + # image (verified via `kubeadm upgrade apply --dry-run`, 2026-06-17). So we + # also skip the coredns addon phase entirely: kubeadm leaves CoreDNS 100% + # untouched and only upgrades the control-plane components. (Root fix: keep + # CoreDNS off Keel — keel.sh/policy=never — so it stops drifting ahead of + # kubeadm's migration table.) + coredns_flags="--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins --skip-phases=addon/coredns" + while ! sudo kubeadm upgrade apply "v$RELEASE" -y $coredns_flags $extra_flags; do if (( attempt >= 3 )); then echo "ERROR: kubeadm upgrade apply failed after 3 attempts" >&2 exit 1 diff --git a/scripts/upgrade_state.sh b/scripts/upgrade_state.sh index 2e6e7faa..e5722f8b 100755 --- a/scripts/upgrade_state.sh +++ b/scripts/upgrade_state.sh @@ -10,7 +10,7 @@ # keel.sh/policy. Metrics on container :9300/metrics. # 2. OS — unattended-upgrades patches in-release per node; kured # reboots within a daily 02:00-06:00 London window. -# 3. K8s — k8s-version-check CronJob (Sun 12:00 UTC) detects new +# 3. K8s — k8s-version-check CronJob (23:00 UTC nightly) detects new # kubeadm patch/minor releases; Job-chain drains+upgrades # node-by-node. Pushgateway holds k8s_upgrade_* gauges. # @@ -443,7 +443,18 @@ collect_k8s() { fi fi - K8S_NEXT="$(next_daily_noon_utc)" + K8S_NEXT="$(next_scheduled_run_utc)" + + # Failed chain-Job detection. A preflight/phase Job can abort BEFORE pushing + # k8s_upgrade_in_flight=1 (the preflight gates exit pre-metric), so in-flight + # / stalled stay clean while the pipeline is actually wedged: the + # deterministic-name + 7d-TTL Job blocks re-spawn. Surface it directly. + # (2026-06-17: a transient critical alert wedged the 1.34.9 preflight for 5 + # days, invisible to every metric-based check.) + local failed_jobs + failed_jobs=$($KUBECTL -n k8s-upgrade get jobs \ + -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Failed")].status}{"\n"}{end}' 2>/dev/null \ + | awk -F'\t' '$2=="True" && $1 ~ /^k8s-upgrade-/{print $1}' | paste -sd' ' - || true) # Status logic. local stalled=0 @@ -463,6 +474,10 @@ collect_k8s() { K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="detection stale" K8S_NOTES="last detection >9d ago" raise_exit 2 + elif [[ -n "$failed_jobs" ]]; then + K8S_STATUS_ICON="✗"; K8S_STATUS_TEXT="chain failed" + K8S_NOTES="failed upgrade Job(s): $failed_jobs — pipeline wedged. Inspect: kubectl -n k8s-upgrade describe job <name> (the retry-on-failure guard re-spawns on the next detection cycle)" + raise_exit 2 elif [[ "${in_flight:-0}" == "1" ]]; then K8S_STATUS_ICON="…"; K8S_STATUS_TEXT="in-flight" K8S_NOTES="upgrade chain running" @@ -481,15 +496,15 @@ collect_k8s() { fi } -# Next daily 12:00 UTC — pure bash date math, no croniter. Schedule was -# weekly Sunday until 2026-05-18; now `0 12 * * *` in the -# k8s-version-upgrade stack. If we're still before today's 12:00 UTC, -# the next run is today; otherwise it's tomorrow. -next_daily_noon_utc() { +# Next daily 23:00 UTC — pure bash date math, no croniter. Schedule is +# `0 23 * * *` in the k8s-version-upgrade stack (overnight; moved from 12:00 UTC +# on 2026-06-17). If we're still before today's 23:00 UTC the next run is today; +# otherwise tomorrow. +next_scheduled_run_utc() { local hr days_ahead hr=$(date -u +%H) - if [[ "$hr" -lt 12 ]]; then days_ahead=0; else days_ahead=1; fi - date -u -d "+$days_ahead days" +"%a %Y-%m-%d 12:00 UTC" + if [[ "$hr" -lt 23 ]]; then days_ahead=0; else days_ahead=1; fi + date -u -d "+$days_ahead days" +"%a %Y-%m-%d 23:00 UTC" } # --- Renderers --- diff --git a/scripts/vzdump-vms.service b/scripts/vzdump-vms.service new file mode 100644 index 00000000..32ac8f96 --- /dev/null +++ b/scripts/vzdump-vms.service @@ -0,0 +1,16 @@ +[Unit] +Description=vzdump image backup of hand-managed VMs (devvm, …) to /mnt/backup +Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/main/docs/architecture/backup-dr.md +After=network-online.target +Wants=network-online.target +RequiresMountsFor=/mnt/backup + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/vzdump-vms +# Be gentle on the contended PVE IO domain (sdc) — backup must never starve etcd. +Nice=10 +IOSchedulingClass=idle +# Reading a ~77 GB disk + zstd can run long under IO contention; well above +# normal (~15-30 min) but bounded so a hung run can't wedge the timer forever. +TimeoutStartSec=4h diff --git a/scripts/vzdump-vms.sh b/scripts/vzdump-vms.sh new file mode 100644 index 00000000..8954d0e5 --- /dev/null +++ b/scripts/vzdump-vms.sh @@ -0,0 +1,120 @@ +#!/usr/bin/env bash +# vzdump-vms — image-level backup of hand-managed Proxmox VMs (NOT in Terraform). +# Deploy to PVE host at /usr/local/bin/vzdump-vms (strip the .sh). +# Schedule: Daily 01:00 via systemd timer. +# +# WHY: the hand-managed Linux VMs (devvm, …) have NO image backup. nfs-mirror / +# daily-backup / offsite-sync cover cluster PVCs, NFS, pfSense and PVE config — +# but never the VM disks themselves. A lost devvm disk = unrecoverable home dirs +# + local-only git repos (the monorepo root has no remote). This takes a live +# `vzdump --mode snapshot` of each configured VMID to /mnt/backup/vzdump (sda = +# Copy 2). The monthly offsite-sync full pass (days 1-7) mirrors /mnt/backup — +# including this dir — to Synology with --delete (Copy 3), bounded to local +# retention. We deliberately do NOT append to the incremental manifest: it never +# deletes, so daily multi-GB images would accumulate unbounded on Synology. +# +# RESTORE: pick a dump under /mnt/backup/vzdump, then on the PVE host: +# qmrestore /mnt/backup/vzdump/vzdump-qemu-<vmid>-<ts>.vma.zst <new-or-same-vmid> +# (restore to a fresh VMID first if the original still exists, then swap), or use +# the PVE UI (Datacenter → Storage → upload dir → Restore). See backup-dr.md. +set -euo pipefail + +# systemd oneshot units get a minimal PATH (/usr/bin:/bin) — qm and vzdump live +# in /usr/sbin, so set an explicit PATH or the script silently can't find them. +export PATH="/usr/sbin:/usr/bin:/sbin:/bin:${PATH:-}" + +# --- Configuration --- +VMIDS="${VZDUMP_VMIDS:-102}" # space-separated. 102 = devvm. Add VMIDs here. +DUMPDIR="${VZDUMP_DUMPDIR:-/mnt/backup/vzdump}" +KEEP="${VZDUMP_KEEP:-3}" # retain N newest dumps per VMID on sda +COMPRESS="${VZDUMP_COMPRESS:-zstd}" +BACKUP_ROOT="/mnt/backup" +PUSHGATEWAY="${VZDUMP_PUSHGATEWAY:-http://10.0.20.100:30091}" +PUSHGATEWAY_JOB="vzdump-backup" +LOCKFILE="/run/vzdump-vms.lock" + +# --- Logging --- +log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; } +warn() { log "WARN: $*" >&2; } + +# --- Metrics (always returns 0 so it never trips set -e) --- +push_metrics() { + local status="${1:-0}" bytes="${2:-0}" now + now=$(date +%s) + { + echo "vzdump_last_run_timestamp ${now}" + echo "vzdump_last_status ${status}" + echo "vzdump_last_bytes ${bytes}" + [ "${status}" -eq 0 ] && echo "vzdump_last_success_timestamp ${now}" + } | curl -s --connect-timeout 5 --max-time 10 --data-binary @- \ + "${PUSHGATEWAY}/metrics/job/${PUSHGATEWAY_JOB}" 2>/dev/null || true + return 0 +} + +# --- Locking (push a non-success metric if systemd kills us mid-run) --- +KILLED="" +cleanup() { + rm -f "${LOCKFILE}" + # NB: must be `if…fi`, NOT `[ … ] && …` — a bash EXIT trap whose LAST command + # returns non-zero overrides the script's `exit 0`, so the `&&` short-circuit + # (when KILLED is empty) would falsely mark a successful backup as failed. + if [ -n "${KILLED}" ]; then push_metrics 2 0; fi +} +trap cleanup EXIT +trap 'KILLED=1; exit 143' TERM INT + +if ! ( set -o noclobber; echo $$ > "${LOCKFILE}" ) 2>/dev/null; then + warn "Another instance running (PID $(cat "${LOCKFILE}" 2>/dev/null || echo unknown)) — exiting" + exit 0 +fi + +# --- Preconditions --- +if ! mountpoint -q "${BACKUP_ROOT}"; then + warn "${BACKUP_ROOT} not mounted — aborting"; push_metrics 1 0; exit 1 +fi +mkdir -p "${DUMPDIR}" + +# --- Main --- +log "=== vzdump-vms starting (VMIDs: ${VMIDS}, keep ${KEEP}) ===" +STATUS=0 +TOTAL_BYTES=0 + +for vmid in ${VMIDS}; do + if ! qm status "${vmid}" >/dev/null 2>&1; then + warn "VMID ${vmid} not found on this node — skipping" + STATUS=1 + continue + fi + + log "--- vzdump ${vmid} ($(qm config "${vmid}" 2>/dev/null | sed -n 's/^name: //p')) ---" + if vzdump "${vmid}" \ + --dumpdir "${DUMPDIR}" \ + --mode snapshot \ + --compress "${COMPRESS}" \ + --ionice 7 \ + --quiet 1; then + newest=$(ls -t "${DUMPDIR}"/vzdump-qemu-"${vmid}"-*.vma.* 2>/dev/null | grep -v '\.notes$' | head -1 || true) + if [ -n "${newest}" ]; then + sz=$(stat -c%s "${newest}" 2>/dev/null || echo 0) + TOTAL_BYTES=$((TOTAL_BYTES + sz)) + log " OK: $(basename "${newest}") ($(numfmt --to=iec "${sz}" 2>/dev/null || echo "${sz}B"))" + fi + else + warn "vzdump ${vmid} failed (rc=$?)" + STATUS=1 + fi + + # Retention: keep newest ${KEEP} per VMID (archive + its .log + .notes siblings). + mapfile -t archives < <(ls -t "${DUMPDIR}"/vzdump-qemu-"${vmid}"-*.vma.* 2>/dev/null | grep -v '\.notes$' || true) + if [ "${#archives[@]}" -gt "${KEEP}" ]; then + for old in "${archives[@]:${KEEP}}"; do + prefix="${old%.vma.*}" # …/vzdump-qemu-<vmid>-<YYYY_MM_DD>-<HH_MM_SS> + log " prune: $(basename "${prefix}")" + rm -f "${prefix}".vma.* "${prefix}".log 2>/dev/null || true + done + fi +done + +log "=== vzdump-vms complete (status=${STATUS}, $(numfmt --to=iec "${TOTAL_BYTES}" 2>/dev/null || echo "${TOTAL_BYTES}B")) ===" +push_metrics "${STATUS}" "${TOTAL_BYTES}" +exit "${STATUS}" diff --git a/scripts/vzdump-vms.timer b/scripts/vzdump-vms.timer new file mode 100644 index 00000000..5cfefd92 --- /dev/null +++ b/scripts/vzdump-vms.timer @@ -0,0 +1,14 @@ +[Unit] +Description=Daily vzdump image backup of hand-managed VMs (devvm, …) +Documentation=https://forgejo.viktorbarzin.me/viktor/infra/src/branch/main/docs/architecture/backup-dr.md + +[Timer] +# 01:00 — ahead of nfs-mirror (02:00), lvm-pvc-snapshot (03:00), daily-backup +# (05:00) and offsite-sync (06:00), so the fresh image is on sda before the +# monthly full offsite pass mirrors /mnt/backup to Synology. +OnCalendar=*-*-* 01:00:00 +RandomizedDelaySec=10min +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/scripts/workstation/claude-auth-sync.sh b/scripts/workstation/claude-auth-sync.sh new file mode 100755 index 00000000..dc3d780d --- /dev/null +++ b/scripts/workstation/claude-auth-sync.sh @@ -0,0 +1,153 @@ +#!/usr/bin/env bash +# Keep one Workstation user's Claude subscription OAuth credentials recoverable. +# Claude owns access/refresh-token rotation in ~/.claude/.credentials.json. This +# helper validates auth with real inference, stores only the claudeAiOauth object +# in the user's isolated Vault path, and attempts one restore on failure. +set -euo pipefail + +CAS_USER="${CLAUDE_AUTH_USER:-$(id -un)}" +CAS_HOME="${HOME:?HOME must be set}" +CAS_CREDENTIALS="${CLAUDE_CREDENTIALS_FILE:-$CAS_HOME/.claude/.credentials.json}" +CAS_CONFIG_DIR="${CLAUDE_AUTH_CONFIG_DIR:-$CAS_HOME/.config/claude-auth-sync}" +CAS_VAULT_TOKEN_FILE="${CLAUDE_AUTH_VAULT_TOKEN_FILE:-$CAS_CONFIG_DIR/vault-token}" +CAS_VAULT_PATH="${CLAUDE_AUTH_VAULT_PATH:-secret/workstation/claude-users/$CAS_USER}" +CAS_STATE_DIR="${CLAUDE_AUTH_STATE_DIR:-$CAS_HOME/.local/state/claude-auth-sync}" +CAS_LOG="$CAS_STATE_DIR/sync.log" + +cas_log() { + mkdir -p "$CAS_STATE_DIR" + printf '%s %s\n' "$(date -Is)" "$*" >> "$CAS_LOG" + logger -t claude-auth-sync -- "user=$CAS_USER $*" 2>/dev/null || true +} + +# Print the Claude OAuth object, or fail without exposing any token material. +cas_oauth_from_credentials() { + jq -ce '.claudeAiOauth + | select((.accessToken | type) == "string" and (.accessToken | length) > 0) + | select((.refreshToken | type) == "string" and (.refreshToken | length) > 0) + | select((.expiresAt | type) == "number")' "$1" +} + +# Merge a recovered OAuth object while preserving unrelated credentials (MCP OAuth). +cas_merge_oauth() { + local credentials="$1" oauth="$2" + jq -ce --argjson oauth "$oauth" '.claudeAiOauth = $oauth' "$credentials" +} + +cas_vault_identity_ok() { + local display_name="$1" policies_csv="$2" + [[ "$display_name" == "token-devvm-claude-auth-$CAS_USER" ]] || return 1 + printf ',%s,' "$policies_csv" | grep -q ",workstation-claude-$CAS_USER," +} + +cas_prepare_vault() { + [[ -s "$CAS_VAULT_TOKEN_FILE" ]] || { + cas_log "FAIL missing scoped Vault token; admin must run workstation provisioning" + return 1 + } + export VAULT_ADDR="${VAULT_ADDR:-https://vault.viktorbarzin.me}" + VAULT_TOKEN="$(<"$CAS_VAULT_TOKEN_FILE")"; export VAULT_TOKEN + + local info display_name policies + info="$(vault token lookup -format=json 2>/dev/null)" || { + cas_log "FAIL scoped Vault token lookup failed" + return 1 + } + display_name="$(jq -r '.data.display_name // ""' <<<"$info")" + policies="$(jq -r '((.data.policies // []) + (.data.identity_policies // [])) | join(",")' <<<"$info")" + cas_vault_identity_ok "$display_name" "$policies" || { + cas_log "FAIL scoped Vault token drift detected; refusing foreign token" + return 1 + } + vault token renew -format=json >/dev/null 2>&1 || { + cas_log "FAIL scoped Vault token renewal failed" + return 1 + } +} + +# auth status is not authoritative: it reported loggedIn=true during a real 401 +# on 2026-06-20. A tiny, non-persistent inference is the feedback loop. +cas_live_auth_ok() { + local out + out="$(timeout 60 claude -p 'Reply with exactly AUTH_OK and nothing else.' \ + --model haiku --max-turns 1 --no-session-persistence --tools "" \ + --disable-slash-commands --setting-sources "" 2>/dev/null)" || return 1 + [[ "$out" == "AUTH_OK" ]] +} + +cas_backup() { + local oauth expires + oauth="$(cas_oauth_from_credentials "$CAS_CREDENTIALS")" || { + cas_log "FAIL local Claude OAuth credential is absent or malformed" + return 1 + } + expires="$(jq -r '.expiresAt' <<<"$oauth")" + vault kv put "$CAS_VAULT_PATH" \ + claude_ai_oauth_json="$oauth" \ + credential_expires_at_ms="$expires" \ + backed_up_at="$(date -Is)" >/dev/null || { + cas_log "FAIL Vault credential backup failed" + return 1 + } + cas_log "OK Claude auth valid; refreshed OAuth state backed up to Vault" +} + +cas_restore() { + local oauth base tmp + oauth="$(vault kv get -field=claude_ai_oauth_json "$CAS_VAULT_PATH" 2>/dev/null)" || { + cas_log "FAIL no recoverable Claude OAuth credential in Vault" + return 1 + } + jq -e 'select((.accessToken | type) == "string" and (.accessToken | length) > 0) + | select((.refreshToken | type) == "string" and (.refreshToken | length) > 0) + | select((.expiresAt | type) == "number")' <<<"$oauth" >/dev/null || { + cas_log "FAIL Vault Claude OAuth credential is malformed" + return 1 + } + + mkdir -p "$(dirname "$CAS_CREDENTIALS")" + if jq -e 'type == "object"' "$CAS_CREDENTIALS" >/dev/null 2>&1; then + base="$CAS_CREDENTIALS" + else + base="$(mktemp)"; printf '{}\n' > "$base" + fi + tmp="$(mktemp "${CAS_CREDENTIALS}.XXXXXX")" + if ! cas_merge_oauth "$base" "$oauth" > "$tmp"; then + rm -f "$tmp"; [[ "$base" == "$CAS_CREDENTIALS" ]] || rm -f "$base" + cas_log "FAIL could not merge Vault Claude OAuth credential" + return 1 + fi + chmod 0600 "$tmp" + mv "$tmp" "$CAS_CREDENTIALS" + [[ "$base" == "$CAS_CREDENTIALS" ]] || rm -f "$base" + cas_log "RECOVERED restored Claude OAuth state from Vault" +} + +cas_main() { + umask 077 + for bin in jq vault claude timeout flock; do + command -v "$bin" >/dev/null || { cas_log "FAIL missing dependency: $bin"; return 1; } + done + mkdir -p "$CAS_STATE_DIR" + exec 9>"$CAS_STATE_DIR/lock" + flock -n 9 || { cas_log "SKIP another sync is already running"; return 0; } + + cas_prepare_vault || return 1 + if cas_live_auth_ok; then + cas_backup + return + fi + + cas_log "WARN live Claude auth failed; attempting one Vault restore" + cas_restore || return 1 + if cas_live_auth_ok; then + cas_backup + return + fi + cas_log "FAIL Claude auth still invalid after Vault restore; interactive SSO login required" + return 1 +} + +if [[ "${BASH_SOURCE[0]}" == "$0" ]]; then + cas_main "$@" +fi diff --git a/scripts/workstation/claude-hooks/auto-learn.py b/scripts/workstation/claude-hooks/auto-learn.py new file mode 100755 index 00000000..174431f9 --- /dev/null +++ b/scripts/workstation/claude-hooks/auto-learn.py @@ -0,0 +1,184 @@ +#!/usr/bin/env python3 +""" +Stop hook (async): automatic learning extraction via haiku-as-judge. + +After each Claude response, sends the user message + assistant response to +haiku to detect corrections, preferences, decisions, or facts worth storing. +If learning events are detected, stores them via the `homelab memory` CLI — the +only sanctioned memory path on the devvm (no direct HTTP, no local SQLite). + +Runs with async: true — does NOT block the user. +""" + +import io +import json +import logging +import os +import shutil +import subprocess +import sys + +logger = logging.getLogger(__name__) + +JUDGE_PROMPT = """You are a memory extraction judge. Analyze this exchange between a user and an AI assistant. + +USER MESSAGE: +{user_message} + +ASSISTANT RESPONSE: +{assistant_response} + +Your job: determine if any of these learning events occurred: +1. USER CORRECTION — user corrected the assistant's mistake or misunderstanding +2. PREFERENCE — user stated a preference, habit, or "I like/prefer/want" statement +3. DECISION — a decision was reached about how to do something +4. FACT — user shared a durable fact about themselves, their team, tools, or environment + +If ANY learning event occurred, return JSON: +{{"events": [{{"type": "correction|preference|decision|fact", "content": "concise fact to remember (one sentence)", "importance": 0.7, "expanded_keywords": "space-separated semantically related search terms for recall (minimum 5 words)", "supersedes": null}}]}} + +If NO learning event occurred, return: +{{"events": []}} + +Rules: +- Only extract DURABLE facts, not transient task details +- Corrections are highest value (0.8-0.9) +- Be conservative — false negatives are better than false positives +- "expanded_keywords" should include synonyms, related concepts, and adjacent topics that would help find this memory later +- "supersedes" should be a search query to find the old outdated memory, or null +- Return ONLY valid JSON, no other text""" + + +def _store_via_homelab_cli(content, category, tags, importance, expanded_keywords): + """Store one memory via the homelab CLI — the only sanctioned memory path on + the devvm (no direct HTTP, no local SQLite). The CLI defaults the API URL and + reads CLAUDE_MEMORY_API_KEY / MEMORY_API_KEY from the environment; if neither + is set (e.g. a user without a minted key) it no-ops silently.""" + homelab = shutil.which("homelab") or "/usr/local/bin/homelab" + if not os.path.exists(homelab): + return + if not (os.environ.get("CLAUDE_MEMORY_API_KEY") or os.environ.get("MEMORY_API_KEY")): + return + cmd = [ + homelab, "memory", "store", content, + "--category", category, + "--tags", tags, + "--importance", str(importance), + ] + if expanded_keywords: + # CLI wants comma-separated keywords; the judge emits space-separated terms. + keywords = ",".join(expanded_keywords.replace(",", " ").split()) + if keywords: + cmd += ["--keywords", keywords] + subprocess.run(cmd, capture_output=True, text=True, timeout=15, env=os.environ) + + +def main() -> None: + # Graceful exit if claude CLI is not available + if not shutil.which("claude"): + return + + try: + hook_input = json.load(sys.stdin) + except (json.JSONDecodeError, EOFError): + return + + if isinstance(hook_input, dict) and hook_input.get("stop_hook_active", False): + return + + transcript_path = "" + if isinstance(hook_input, dict): + transcript_path = hook_input.get("transcript_path", "") + + if not transcript_path or not os.path.exists(transcript_path): + return + + user_message = "" + assistant_response = "" + try: + MAX_TAIL_BYTES = 50_000 + with open(transcript_path, "rb") as f: + f.seek(0, io.SEEK_END) + size = f.tell() + f.seek(max(0, size - MAX_TAIL_BYTES)) + tail = f.read().decode("utf-8", errors="replace") + lines = tail.split("\n") + + for line in reversed(lines): + line = line.strip() + if not line: + continue + try: + entry = json.loads(line) + except json.JSONDecodeError: + continue + role = entry.get("role", "") + content = entry.get("content", "") + if isinstance(content, list): + content = " ".join( + b.get("text", "") for b in content + if isinstance(b, dict) and b.get("type") == "text" + ) + content = str(content)[:2000] + if role == "assistant" and not assistant_response: + assistant_response = content + elif role == "user" and not user_message: + user_message = content + if user_message and assistant_response: + break + except Exception: + return + + if not user_message or len(user_message.strip()) < 10: + return + + prompt = JUDGE_PROMPT.format( + user_message=user_message, + assistant_response=assistant_response[:1000], + ) + + try: + result = subprocess.run( + ["claude", "-p", prompt, "--model", "haiku"], + capture_output=True, text=True, timeout=30, + env={**os.environ, "CLAUDECODE": ""}, + ) + if result.returncode != 0: + return + response_text = result.stdout.strip() + if response_text.startswith("```"): + lines = response_text.split("\n") + lines = [l for l in lines if not l.strip().startswith("```")] + response_text = "\n".join(lines).strip() + judge_result = json.loads(response_text) + events = judge_result.get("events", []) + if not events: + return + except (subprocess.TimeoutExpired, json.JSONDecodeError, OSError): + return + + category_map = { + "correction": "preferences", + "preference": "preferences", + "decision": "decisions", + "fact": "facts", + } + + for event in events: + content = event.get("content", "") + if not content: + continue + event_type = event.get("type", "fact") + importance = max(0.0, min(1.0, float(event.get("importance", 0.7)))) + category = category_map.get(event_type, "facts") + tags = f"auto-learned,{event_type}" + expanded_keywords = event.get("expanded_keywords", "") + + try: + _store_via_homelab_cli(content, category, tags, importance, expanded_keywords) + except Exception: + pass # Never crash the async hook + + +if __name__ == "__main__": + main() diff --git a/scripts/workstation/claude-hooks/homelab-memory-recall.py b/scripts/workstation/claude-hooks/homelab-memory-recall.py new file mode 100755 index 00000000..7315f116 --- /dev/null +++ b/scripts/workstation/claude-hooks/homelab-memory-recall.py @@ -0,0 +1,70 @@ +#!/usr/bin/env python3 +"""UserPromptSubmit hook: inject relevant memories via `homelab memory recall`. + +Replaces the claude-memory MCP recall path. Instead of instructing the model to +call the memory_recall MCP tool, this hook runs the homelab CLI (a direct client +to the same claude-memory HTTP API) and injects the ACTUAL results as context — +so recall is automatic, needs no model tool-call, and works with the MCP +uninstalled. Best-effort: any failure exits 0 silently (recall just doesn't +happen that turn, exactly like the MCP being unavailable). + +Wizard-only trial of the MCP deprecation (2026-06-20). Reversible: restore the +plugin command in ~/.claude/settings.json (backup: settings.json.bak-pre-homelab-memory). +""" + +import json +import os +import shutil +import subprocess +import sys + + +def main() -> None: + try: + hook_input = json.load(sys.stdin) + except (json.JSONDecodeError, EOFError): + return + + prompt = "" + if isinstance(hook_input, dict): + prompt = hook_input.get("prompt") or hook_input.get("user_prompt") or "" + if not prompt and isinstance(hook_input.get("content"), str): + prompt = hook_input["content"] + prompt = (prompt or "").strip() + + # Same gates as the original recall hook: skip short prompts, code/JSON/XML blobs. + if len(prompt) < 10 or prompt[0] in "`{<": + return + + homelab = shutil.which("homelab") or "/usr/local/bin/homelab" + if not os.path.exists(homelab): + return + if not (os.environ.get("CLAUDE_MEMORY_API_KEY") or os.environ.get("MEMORY_API_KEY")): + return + + try: + res = subprocess.run( + [homelab, "memory", "recall", prompt, "--limit", "5"], + capture_output=True, text=True, timeout=4, env=os.environ, + ) + except (subprocess.TimeoutExpired, OSError): + return + + out = (res.stdout or "").strip() + if res.returncode != 0 or not out: + return + + context = ( + "Relevant stored memories (via `homelab memory recall`) — incorporate " + "naturally if useful; do NOT mention this lookup to the user:\n\n" + out + ) + print(json.dumps({ + "hookSpecificOutput": { + "hookEventName": "UserPromptSubmit", + "additionalContext": context, + } + })) + + +if __name__ == "__main__": + main() diff --git a/scripts/workstation/claude-hooks/post-compact-recovery.sh b/scripts/workstation/claude-hooks/post-compact-recovery.sh new file mode 100755 index 00000000..4687d951 --- /dev/null +++ b/scripts/workstation/claude-hooks/post-compact-recovery.sh @@ -0,0 +1,64 @@ +#!/bin/bash +# UserPromptSubmit hook: Inject recovery context after compaction +# This hook runs on each user prompt, but only injects context once after compaction. + +# Read hook input from stdin +INPUT=$(cat) + +# Extract session ID +SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // .sessionId // "unknown"') + +# Define marker path +MEMORY_HOME="${MEMORY_HOME:-$HOME/.claude/claude-memory}" +MARKER_DIR="${MEMORY_HOME}/state/compaction-markers" +MARKER_FILE="${MARKER_DIR}/${SESSION_ID}.json" + +# Fast path: no marker means no recent compaction, exit immediately +if [ ! -f "$MARKER_FILE" ]; then + exit 0 +fi + +# Read marker contents +MARKER=$(cat "$MARKER_FILE") + +# Validate JSON before processing +if ! echo "$MARKER" | jq -e . >/dev/null 2>&1; then + rm -f "$MARKER_FILE" + exit 0 +fi + +# Extract data from marker +COMPACTED_AT=$(echo "$MARKER" | jq -r '.compactedAt // "unknown"') +PERSONALITY=$(echo "$MARKER" | jq -r '.personalityReminder // ""') + +# Build remembered facts summary (limit to ~500 chars) +FACTS_SUMMARY=$(echo "$MARKER" | jq -r ' + .rememberedFacts[:10] | + map("- [\(.category // "fact")] \(.content)") | + join("\n") +' 2>/dev/null || echo "") + +# Build recovery context (kept under 1000 tokens) +RECOVERY_CONTEXT="[Claude Memory Recovery - Context compacted at ${COMPACTED_AT}] + +${PERSONALITY} + +Key memories from before compaction: +${FACTS_SUMMARY} + +Use the memory_recall MCP tool if you need more context about past conversations." + +# Output JSON with additional context for injection +cat << EOF +{ + "hookSpecificOutput": { + "hookEventName": "UserPromptSubmit", + "additionalContext": $(echo "$RECOVERY_CONTEXT" | jq -Rs .) + } +} +EOF + +# Delete marker file (one-time injection) +rm -f "$MARKER_FILE" + +exit 0 diff --git a/scripts/workstation/claude-hooks/pre-compact-backup.sh b/scripts/workstation/claude-hooks/pre-compact-backup.sh new file mode 100755 index 00000000..1194b12d --- /dev/null +++ b/scripts/workstation/claude-hooks/pre-compact-backup.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# PreCompact hook: Save key memories before compaction +set -e + +INPUT=$(cat) +SESSION_ID=$(echo "$INPUT" | jq -r '.session_id // .sessionId // "unknown"') + +MEMORY_HOME="${MEMORY_HOME:-$HOME/.claude/claude-memory}" +MARKER_DIR="${MEMORY_HOME}/state/compaction-markers" +MEMORY_DB="${MEMORY_HOME}/memory/memory.db" +MARKER_FILE="${MARKER_DIR}/${SESSION_ID}.json" + +mkdir -p "$MARKER_DIR" + +TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + +# Try API first, fall back to SQLite +REMEMBERED_FACTS="[]" +if [ -n "${MEMORY_API_KEY:-${CLAUDE_MEMORY_API_KEY:-}}" ]; then + API_KEY="${MEMORY_API_KEY:-${CLAUDE_MEMORY_API_KEY:-}}" + API_URL="${MEMORY_API_URL:-${CLAUDE_MEMORY_API_URL:-}}" + if [ -n "$API_URL" ]; then + REMEMBERED_FACTS=$(curl -sf -H "Authorization: Bearer ${API_KEY}" \ + "${API_URL}/api/memories?limit=20" 2>/dev/null | \ + jq '[.memories[] | {content, category, importance}]' 2>/dev/null || echo "[]") + fi +elif [ -f "$MEMORY_DB" ]; then + REMEMBERED_FACTS=$(sqlite3 -json "$MEMORY_DB" \ + "SELECT content, category, importance FROM memories ORDER BY importance DESC, created_at DESC LIMIT 20" 2>/dev/null || echo "[]") +fi + +if ! echo "$REMEMBERED_FACTS" | jq empty 2>/dev/null; then + REMEMBERED_FACTS="[]" +fi + +jq -n \ + --arg sid "$SESSION_ID" \ + --arg ts "$TIMESTAMP" \ + --argjson facts "$REMEMBERED_FACTS" \ + '{sessionId: $sid, compactedAt: $ts, rememberedFacts: $facts}' \ + > "$MARKER_FILE" + +exit 0 diff --git a/scripts/workstation/claude-hooks/wire-memory-hooks.py b/scripts/workstation/claude-hooks/wire-memory-hooks.py new file mode 100644 index 00000000..c33b504c --- /dev/null +++ b/scripts/workstation/claude-hooks/wire-memory-hooks.py @@ -0,0 +1,90 @@ +#!/usr/bin/env python3 +"""Wire the homelab-memory hooks into a user's ~/.claude/settings.json. + +Part of the claude-memory MCP -> homelab CLI migration (all-users rollout). +Two passes, idempotent, never touching `env` (the per-user MEMORY_API_KEY) or any +other setting: + (0) PRUNE any hook command still pointing at the retired claude-memory plugin + (`plugins/claude-memory/hooks/`). install_memory() rm -rf's that dir, so + those entries are dangling — and a missing UserPromptSubmit hook exits 2, + a BLOCKING error that erases the prompt and freezes the session (devvm emo + incident 2026-06-22). Must run BEFORE the additive pass: the plugin shares + basenames with the homelab hooks, so without pruning, the "already present" + check below matches the dead plugin path and skips the real install. + (1) ADD each homelab hook group when no existing command references its script. + +Usage: wire-memory-hooks.py <home_dir> +Exit 0 on success (changed or already-present); 1 only on an unreadable settings file. +""" +import json +import os +import sys + +home = sys.argv[1] +settings = os.path.join(home, ".claude", "settings.json") +hooks_dir = os.path.join(home, ".claude", "hooks") + +# (event, script-basename used for the if-absent check, full command, extra fields) +WANT = [ + ("PreCompact", "pre-compact-backup.sh", f"{hooks_dir}/pre-compact-backup.sh", {"timeout": 30}), + ("UserPromptSubmit", "post-compact-recovery.sh", f"{hooks_dir}/post-compact-recovery.sh", {"timeout": 10}), + ("UserPromptSubmit", "homelab-memory-recall.py", f"python3 {hooks_dir}/homelab-memory-recall.py", {"timeout": 8}), + ("Stop", "auto-learn.py", f"python3 {hooks_dir}/auto-learn.py", {"async": True}), +] + +try: + if os.path.exists(settings) and os.path.getsize(settings) > 0: + with open(settings) as fh: + data = json.load(fh) + else: + data = {} +except (json.JSONDecodeError, OSError) as e: + print(f"ERROR: cannot read {settings}: {e}", file=sys.stderr) + sys.exit(1) + +hooks = data.setdefault("hooks", {}) +changed = False + +# (0) Prune dead claude-memory plugin hooks (see module docstring). Must precede +# the additive pass so shared basenames don't mask a needed install. +DEAD_REF = "plugins/claude-memory/hooks/" +for event in list(hooks.keys()): + new_groups = [] + removed_any = False + for g in (hooks.get(event) or []): + original = g.get("hooks") or [] + kept = [h for h in original if DEAD_REF not in (h.get("command", "") or "")] + if len(kept) != len(original): + removed_any = True + if kept: + new_groups.append({**g, "hooks": kept}) + if removed_any: + changed = True + if new_groups: + hooks[event] = new_groups + else: + del hooks[event] + +# (1) Additively wire each homelab hook, if no command already references it. +for event, basename, command, extra in WANT: + groups = hooks.setdefault(event, []) + already = any( + basename in (h.get("command", "") or "") + for g in groups + for h in (g.get("hooks", []) or []) + ) + if already: + continue + entry = {"type": "command", "command": command} + entry.update(extra) + groups.append({"hooks": [entry]}) + changed = True + +if changed: + tmp = settings + ".tmp" + with open(tmp, "w") as fh: + json.dump(data, fh, indent=2) + os.replace(tmp, settings) + print(f"wired memory hooks -> {settings}") +else: + print(f"memory hooks already present -> {settings} (no change)") diff --git a/scripts/workstation/claude-skills/README.md b/scripts/workstation/claude-skills/README.md new file mode 100644 index 00000000..816cbcb7 --- /dev/null +++ b/scripts/workstation/claude-skills/README.md @@ -0,0 +1,31 @@ +# claude-skills — vendored agent-skill snapshot + +Point-in-time snapshot of the admin's (`wizard`) Claude Code agent skills, deployed +per-user by `install_skills()` in `../../t3-provision-users.sh` (scoped to the +`SKILL_USERS` allowlist). Each subdirectory is one skill (`SKILL.md` + any bundled +references). The provisioner copies a skill into `~/.agents/skills/<name>/` (owned by +the user) and symlinks `~/.claude/skills/<name> -> ../../.agents/skills/<name>` — the +layout the `skills` CLI's `-g` install produces; Claude Code reads `~/.claude/skills/`. + +## Why vendored (not `npx skills add` at provision time) + +Upstream drifted from this set: on `mattpocock/skills` master, `diagnose` → +`diagnosing-bugs` and `write-a-skill` → `writing-great-skills` were renamed, and +`caveman` + `zoom-out` are no longer published — so `npx skills` cannot reproduce this +exact set. Vendoring is also offline/deterministic and keeps GitHub-clone + +unpinned-CLI dependencies out of the hourly **root** reconcile. + +## Sources + +- `mattpocock/skills` (https://github.com/mattpocock/skills) — all except `find-skills` +- `vercel-labs/skills` (https://github.com/vercel-labs/skills) — `find-skills` + +## Refreshing + +Re-snapshot from a current install and commit the diff: + +```sh +cp -a ~/.agents/skills/. scripts/workstation/claude-skills/ +``` + +Snapshot taken 2026-06-23. diff --git a/scripts/workstation/claude-skills/caveman/SKILL.md b/scripts/workstation/claude-skills/caveman/SKILL.md new file mode 100644 index 00000000..85770a38 --- /dev/null +++ b/scripts/workstation/claude-skills/caveman/SKILL.md @@ -0,0 +1,49 @@ +--- +name: caveman +description: > + Ultra-compressed communication mode. Cuts token usage ~75% by dropping + filler, articles, and pleasantries while keeping full technical accuracy. + Use when user says "caveman mode", "talk like caveman", "use caveman", + "less tokens", "be brief", or invokes /caveman. +--- + +Respond terse like smart caveman. All technical substance stay. Only fluff die. + +## Persistence + +ACTIVE EVERY RESPONSE once triggered. No revert after many turns. No filler drift. Still active if unsure. Off only when user says "stop caveman" or "normal mode". + +## Rules + +Drop: articles (a/an/the), filler (just/really/basically/actually/simply), pleasantries (sure/certainly/of course/happy to), hedging. Fragments OK. Short synonyms (big not extensive, fix not "implement a solution for"). Abbreviate common terms (DB/auth/config/req/res/fn/impl). Strip conjunctions. Use arrows for causality (X -> Y). One word when one word enough. + +Technical terms stay exact. Code blocks unchanged. Errors quoted exact. + +Pattern: `[thing] [action] [reason]. [next step].` + +Not: "Sure! I'd be happy to help you with that. The issue you're experiencing is likely caused by..." +Yes: "Bug in auth middleware. Token expiry check use `<` not `<=`. Fix:" + +### Examples + +**"Why React component re-render?"** + +> Inline obj prop -> new ref -> re-render. `useMemo`. + +**"Explain database connection pooling."** + +> Pool = reuse DB conn. Skip handshake -> fast under load. + +## Auto-Clarity Exception + +Drop caveman temporarily for: security warnings, irreversible action confirmations, multi-step sequences where fragment order risks misread, user asks to clarify or repeats question. Resume caveman after clear part done. + +Example -- destructive op: + +> **Warning:** This will permanently delete all rows in the `users` table and cannot be undone. +> +> ```sql +> DROP TABLE users; +> ``` +> +> Caveman resume. Verify backup exist first. diff --git a/scripts/workstation/claude-skills/diagnose/SKILL.md b/scripts/workstation/claude-skills/diagnose/SKILL.md new file mode 100644 index 00000000..ed55bda2 --- /dev/null +++ b/scripts/workstation/claude-skills/diagnose/SKILL.md @@ -0,0 +1,117 @@ +--- +name: diagnose +description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression. +--- + +# Diagnose + +A discipline for hard bugs. Skip phases only when explicitly justified. + +When exploring the codebase, use the project's domain glossary to get a clear mental model of the relevant modules, and check ADRs in the area you're touching. + +## Phase 1 — Build a feedback loop + +**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you. + +Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.** + +### Ways to construct one — try them in roughly this order + +1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e. +2. **Curl / HTTP script** against a running dev server. +3. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot. +4. **Headless browser script** (Playwright / Puppeteer) — drives the UI, asserts on DOM/console/network. +5. **Replay a captured trace.** Save a real network request / payload / event log to disk; replay it through the code path in isolation. +6. **Throwaway harness.** Spin up a minimal subset of the system (one service, mocked deps) that exercises the bug code path with a single function call. +7. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode. +8. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it. +9. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs. +10. **HITL bash script.** Last resort. If a human must click, drive _them_ with `scripts/hitl-loop.template.sh` so the loop is still structured. Captured output feeds back to you. + +Build the right feedback loop, and the bug is 90% fixed. + +### Iterate on the loop itself + +Treat the loop as a product. Once you have _a_ loop, ask: + +- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.) +- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".) +- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.) + +A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower. + +### Non-deterministic bugs + +The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable. + +### When you genuinely cannot build a loop + +Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps), or (c) permission to add temporary production instrumentation. Do **not** proceed to hypothesise without a loop. + +Do not proceed to Phase 2 until you have a loop you believe in. + +## Phase 2 — Reproduce + +Run the loop. Watch the bug appear. + +Confirm: + +- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix. +- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against). +- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it. + +Do not proceed until you reproduce the bug. + +## Phase 3 — Hypothesise + +Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea. + +Each hypothesis must be **falsifiable**: state the prediction it makes. + +> Format: "If <X> is the cause, then <changing Y> will make the bug disappear / <changing Z> will make it worse." + +If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it. + +**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just deployed a change to #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK. + +## Phase 4 — Instrument + +Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.** + +Tool preference: + +1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs. +2. **Targeted logs** at the boundaries that distinguish hypotheses. +3. Never "log everything and grep". + +**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die. + +**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan), then bisect. Measure first, fix second. + +## Phase 5 — Fix + regression test + +Write the regression test **before the fix** — but only if there is a **correct seam** for it. + +A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence. + +**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase. + +If a correct seam exists: + +1. Turn the minimised repro into a failing test at that seam. +2. Watch it fail. +3. Apply the fix. +4. Watch it pass. +5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario. + +## Phase 6 — Cleanup + post-mortem + +Required before declaring done: + +- [ ] Original repro no longer reproduces (re-run the Phase 1 loop) +- [ ] Regression test passes (or absence of seam is documented) +- [ ] All `[DEBUG-...]` instrumentation removed (`grep` the prefix) +- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location) +- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns + +**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to the `/improve-codebase-architecture` skill with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started. diff --git a/scripts/workstation/claude-skills/diagnose/scripts/hitl-loop.template.sh b/scripts/workstation/claude-skills/diagnose/scripts/hitl-loop.template.sh new file mode 100644 index 00000000..40afc465 --- /dev/null +++ b/scripts/workstation/claude-skills/diagnose/scripts/hitl-loop.template.sh @@ -0,0 +1,41 @@ +#!/usr/bin/env bash +# Human-in-the-loop reproduction loop. +# Copy this file, edit the steps below, and run it. +# The agent runs the script; the user follows prompts in their terminal. +# +# Usage: +# bash hitl-loop.template.sh +# +# Two helpers: +# step "<instruction>" → show instruction, wait for Enter +# capture VAR "<question>" → show question, read response into VAR +# +# At the end, captured values are printed as KEY=VALUE for the agent to parse. + +set -euo pipefail + +step() { + printf '\n>>> %s\n' "$1" + read -r -p " [Enter when done] " _ +} + +capture() { + local var="$1" question="$2" answer + printf '\n>>> %s\n' "$question" + read -r -p " > " answer + printf -v "$var" '%s' "$answer" +} + +# --- edit below --------------------------------------------------------- + +step "Open the app at http://localhost:3000 and sign in." + +capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)" + +capture ERROR_MSG "Paste the error message (or 'none'):" + +# --- edit above --------------------------------------------------------- + +printf '\n--- Captured ---\n' +printf 'ERRORED=%s\n' "$ERRORED" +printf 'ERROR_MSG=%s\n' "$ERROR_MSG" diff --git a/scripts/workstation/claude-skills/find-skills/SKILL.md b/scripts/workstation/claude-skills/find-skills/SKILL.md new file mode 100644 index 00000000..114c6637 --- /dev/null +++ b/scripts/workstation/claude-skills/find-skills/SKILL.md @@ -0,0 +1,142 @@ +--- +name: find-skills +description: Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill. +--- + +# Find Skills + +This skill helps you discover and install skills from the open agent skills ecosystem. + +## When to Use This Skill + +Use this skill when the user: + +- Asks "how do I do X" where X might be a common task with an existing skill +- Says "find a skill for X" or "is there a skill for X" +- Asks "can you do X" where X is a specialized capability +- Expresses interest in extending agent capabilities +- Wants to search for tools, templates, or workflows +- Mentions they wish they had help with a specific domain (design, testing, deployment, etc.) + +## What is the Skills CLI? + +The Skills CLI (`npx skills`) is the package manager for the open agent skills ecosystem. Skills are modular packages that extend agent capabilities with specialized knowledge, workflows, and tools. + +**Key commands:** + +- `npx skills find [query]` - Search for skills interactively or by keyword +- `npx skills add <package>` - Install a skill from GitHub or other sources +- `npx skills check` - Check for skill updates +- `npx skills update` - Update all installed skills + +**Browse skills at:** https://skills.sh/ + +## How to Help Users Find Skills + +### Step 1: Understand What They Need + +When a user asks for help with something, identify: + +1. The domain (e.g., React, testing, design, deployment) +2. The specific task (e.g., writing tests, creating animations, reviewing PRs) +3. Whether this is a common enough task that a skill likely exists + +### Step 2: Check the Leaderboard First + +Before running a CLI search, check the [skills.sh leaderboard](https://skills.sh/) to see if a well-known skill already exists for the domain. The leaderboard ranks skills by total installs, surfacing the most popular and battle-tested options. + +For example, top skills for web development include: +- `vercel-labs/agent-skills` — React, Next.js, web design (100K+ installs each) +- `anthropics/skills` — Frontend design, document processing (100K+ installs) + +### Step 3: Search for Skills + +If the leaderboard doesn't cover the user's need, run the find command: + +```bash +npx skills find [query] +``` + +For example: + +- User asks "how do I make my React app faster?" → `npx skills find react performance` +- User asks "can you help me with PR reviews?" → `npx skills find pr review` +- User asks "I need to create a changelog" → `npx skills find changelog` + +### Step 4: Verify Quality Before Recommending + +**Do not recommend a skill based solely on search results.** Always verify: + +1. **Install count** — Prefer skills with 1K+ installs. Be cautious with anything under 100. +2. **Source reputation** — Official sources (`vercel-labs`, `anthropics`, `microsoft`) are more trustworthy than unknown authors. +3. **GitHub stars** — Check the source repository. A skill from a repo with <100 stars should be treated with skepticism. + +### Step 5: Present Options to the User + +When you find relevant skills, present them to the user with: + +1. The skill name and what it does +2. The install count and source +3. The install command they can run +4. A link to learn more at skills.sh + +Example response: + +``` +I found a skill that might help! The "react-best-practices" skill provides +React and Next.js performance optimization guidelines from Vercel Engineering. +(185K installs) + +To install it: +npx skills add vercel-labs/agent-skills@react-best-practices + +Learn more: https://skills.sh/vercel-labs/agent-skills/react-best-practices +``` + +### Step 6: Offer to Install + +If the user wants to proceed, you can install the skill for them: + +```bash +npx skills add <owner/repo@skill> -g -y +``` + +The `-g` flag installs globally (user-level) and `-y` skips confirmation prompts. + +## Common Skill Categories + +When searching, consider these common categories: + +| Category | Example Queries | +| --------------- | ---------------------------------------- | +| Web Development | react, nextjs, typescript, css, tailwind | +| Testing | testing, jest, playwright, e2e | +| DevOps | deploy, docker, kubernetes, ci-cd | +| Documentation | docs, readme, changelog, api-docs | +| Code Quality | review, lint, refactor, best-practices | +| Design | ui, ux, design-system, accessibility | +| Productivity | workflow, automation, git | + +## Tips for Effective Searches + +1. **Use specific keywords**: "react testing" is better than just "testing" +2. **Try alternative terms**: If "deploy" doesn't work, try "deployment" or "ci-cd" +3. **Check popular sources**: Many skills come from `vercel-labs/agent-skills` or `ComposioHQ/awesome-claude-skills` + +## When No Skills Are Found + +If no relevant skills exist: + +1. Acknowledge that no existing skill was found +2. Offer to help with the task directly using your general capabilities +3. Suggest the user could create their own skill with `npx skills init` + +Example: + +``` +I searched for skills related to "xyz" but didn't find any matches. +I can still help you with this task directly! Would you like me to proceed? + +If this is something you do often, you could create your own skill: +npx skills init my-xyz-skill +``` diff --git a/scripts/workstation/claude-skills/grill-me/SKILL.md b/scripts/workstation/claude-skills/grill-me/SKILL.md new file mode 100644 index 00000000..bd04394c --- /dev/null +++ b/scripts/workstation/claude-skills/grill-me/SKILL.md @@ -0,0 +1,10 @@ +--- +name: grill-me +description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me". +--- + +Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer. + +Ask the questions one at a time. + +If a question can be answered by exploring the codebase, explore the codebase instead. diff --git a/scripts/workstation/claude-skills/grill-with-docs/ADR-FORMAT.md b/scripts/workstation/claude-skills/grill-with-docs/ADR-FORMAT.md new file mode 100644 index 00000000..da7e78ec --- /dev/null +++ b/scripts/workstation/claude-skills/grill-with-docs/ADR-FORMAT.md @@ -0,0 +1,47 @@ +# ADR Format + +ADRs live in `docs/adr/` and use sequential numbering: `0001-slug.md`, `0002-slug.md`, etc. + +Create the `docs/adr/` directory lazily — only when the first ADR is needed. + +## Template + +```md +# {Short title of the decision} + +{1-3 sentences: what's the context, what did we decide, and why.} +``` + +That's it. An ADR can be a single paragraph. The value is in recording *that* a decision was made and *why* — not in filling out sections. + +## Optional sections + +Only include these when they add genuine value. Most ADRs won't need them. + +- **Status** frontmatter (`proposed | accepted | deprecated | superseded by ADR-NNNN`) — useful when decisions are revisited +- **Considered Options** — only when the rejected alternatives are worth remembering +- **Consequences** — only when non-obvious downstream effects need to be called out + +## Numbering + +Scan `docs/adr/` for the highest existing number and increment by one. + +## When to offer an ADR + +All three of these must be true: + +1. **Hard to reverse** — the cost of changing your mind later is meaningful +2. **Surprising without context** — a future reader will look at the code and wonder "why on earth did they do it this way?" +3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons + +If a decision is easy to reverse, skip it — you'll just reverse it. If it's not surprising, nobody will wonder why. If there was no real alternative, there's nothing to record beyond "we did the obvious thing." + +### What qualifies + +- **Architectural shape.** "We're using a monorepo." "The write model is event-sourced, the read model is projected into Postgres." +- **Integration patterns between contexts.** "Ordering and Billing communicate via domain events, not synchronous HTTP." +- **Technology choices that carry lock-in.** Database, message bus, auth provider, deployment target. Not every library — just the ones that would take a quarter to swap out. +- **Boundary and scope decisions.** "Customer data is owned by the Customer context; other contexts reference it by ID only." The explicit no-s are as valuable as the yes-s. +- **Deliberate deviations from the obvious path.** "We're using manual SQL instead of an ORM because X." Anything where a reasonable reader would assume the opposite. These stop the next engineer from "fixing" something that was deliberate. +- **Constraints not visible in the code.** "We can't use AWS because of compliance requirements." "Response times must be under 200ms because of the partner API contract." +- **Rejected alternatives when the rejection is non-obvious.** If you considered GraphQL and picked REST for subtle reasons, record it — otherwise someone will suggest GraphQL again in six months. diff --git a/scripts/workstation/claude-skills/grill-with-docs/CONTEXT-FORMAT.md b/scripts/workstation/claude-skills/grill-with-docs/CONTEXT-FORMAT.md new file mode 100644 index 00000000..eaf2a185 --- /dev/null +++ b/scripts/workstation/claude-skills/grill-with-docs/CONTEXT-FORMAT.md @@ -0,0 +1,60 @@ +# CONTEXT.md Format + +## Structure + +```md +# {Context Name} + +{One or two sentence description of what this context is and why it exists.} + +## Language + +**Order**: +{A one or two sentence description of the term} +_Avoid_: Purchase, transaction + +**Invoice**: +A request for payment sent to a customer after delivery. +_Avoid_: Bill, payment request + +**Customer**: +A person or organization that places orders. +_Avoid_: Client, buyer, account +``` + +## Rules + +- **Be opinionated.** When multiple words exist for the same concept, pick the best one and list the others under `_Avoid_`. +- **Keep definitions tight.** One or two sentences max. Define what it IS, not what it does. +- **Only include terms specific to this project's context.** General programming concepts (timeouts, error types, utility patterns) don't belong even if the project uses them extensively. Before adding a term, ask: is this a concept unique to this context, or a general programming concept? Only the former belongs. +- **Group terms under subheadings** when natural clusters emerge. If all terms belong to a single cohesive area, a flat list is fine. + +## Single vs multi-context repos + +**Single context (most repos):** One `CONTEXT.md` at the repo root. + +**Multiple contexts:** A `CONTEXT-MAP.md` at the repo root lists the contexts, where they live, and how they relate to each other: + +```md +# Context Map + +## Contexts + +- [Ordering](./src/ordering/CONTEXT.md) — receives and tracks customer orders +- [Billing](./src/billing/CONTEXT.md) — generates invoices and processes payments +- [Fulfillment](./src/fulfillment/CONTEXT.md) — manages warehouse picking and shipping + +## Relationships + +- **Ordering → Fulfillment**: Ordering emits `OrderPlaced` events; Fulfillment consumes them to start picking +- **Fulfillment → Billing**: Fulfillment emits `ShipmentDispatched` events; Billing consumes them to generate invoices +- **Ordering ↔ Billing**: Shared types for `CustomerId` and `Money` +``` + +The skill infers which structure applies: + +- If `CONTEXT-MAP.md` exists, read it to find contexts +- If only a root `CONTEXT.md` exists, single context +- If neither exists, create a root `CONTEXT.md` lazily when the first term is resolved + +When multiple contexts exist, infer which one the current topic relates to. If unclear, ask. diff --git a/scripts/workstation/claude-skills/grill-with-docs/SKILL.md b/scripts/workstation/claude-skills/grill-with-docs/SKILL.md new file mode 100644 index 00000000..5ea0aa91 --- /dev/null +++ b/scripts/workstation/claude-skills/grill-with-docs/SKILL.md @@ -0,0 +1,88 @@ +--- +name: grill-with-docs +description: Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions. +--- + +<what-to-do> + +Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer. + +Ask the questions one at a time, waiting for feedback on each question before continuing. + +If a question can be answered by exploring the codebase, explore the codebase instead. + +</what-to-do> + +<supporting-info> + +## Domain awareness + +During codebase exploration, also look for existing documentation: + +### File structure + +Most repos have a single context: + +``` +/ +├── CONTEXT.md +├── docs/ +│ └── adr/ +│ ├── 0001-event-sourced-orders.md +│ └── 0002-postgres-for-write-model.md +└── src/ +``` + +If a `CONTEXT-MAP.md` exists at the root, the repo has multiple contexts. The map points to where each one lives: + +``` +/ +├── CONTEXT-MAP.md +├── docs/ +│ └── adr/ ← system-wide decisions +├── src/ +│ ├── ordering/ +│ │ ├── CONTEXT.md +│ │ └── docs/adr/ ← context-specific decisions +│ └── billing/ +│ ├── CONTEXT.md +│ └── docs/adr/ +``` + +Create files lazily — only when you have something to write. If no `CONTEXT.md` exists, create one when the first term is resolved. If no `docs/adr/` exists, create it when the first ADR is needed. + +## During the session + +### Challenge against the glossary + +When the user uses a term that conflicts with the existing language in `CONTEXT.md`, call it out immediately. "Your glossary defines 'cancellation' as X, but you seem to mean Y — which is it?" + +### Sharpen fuzzy language + +When the user uses vague or overloaded terms, propose a precise canonical term. "You're saying 'account' — do you mean the Customer or the User? Those are different things." + +### Discuss concrete scenarios + +When domain relationships are being discussed, stress-test them with specific scenarios. Invent scenarios that probe edge cases and force the user to be precise about the boundaries between concepts. + +### Cross-reference with code + +When the user states how something works, check whether the code agrees. If you find a contradiction, surface it: "Your code cancels entire Orders, but you just said partial cancellation is possible — which is right?" + +### Update CONTEXT.md inline + +When a term is resolved, update `CONTEXT.md` right there. Don't batch these up — capture them as they happen. Use the format in [CONTEXT-FORMAT.md](./CONTEXT-FORMAT.md). + +`CONTEXT.md` should be totally devoid of implementation details. Do not treat `CONTEXT.md` as a spec, a scratch pad, or a repository for implementation decisions. It is a glossary and nothing else. + +### Offer ADRs sparingly + +Only offer to create an ADR when all three are true: + +1. **Hard to reverse** — the cost of changing your mind later is meaningful +2. **Surprising without context** — a future reader will wonder "why did they do it this way?" +3. **The result of a real trade-off** — there were genuine alternatives and you picked one for specific reasons + +If any of the three is missing, skip the ADR. Use the format in [ADR-FORMAT.md](./ADR-FORMAT.md). + +</supporting-info> diff --git a/scripts/workstation/claude-skills/handoff/SKILL.md b/scripts/workstation/claude-skills/handoff/SKILL.md new file mode 100644 index 00000000..28bfb3ab --- /dev/null +++ b/scripts/workstation/claude-skills/handoff/SKILL.md @@ -0,0 +1,13 @@ +--- +name: handoff +description: Compact the current conversation into a handoff document for another agent to pick up. +argument-hint: "What will the next session be used for?" +--- + +Write a handoff document summarising the current conversation so a fresh agent can continue the work. Save it to a path produced by `mktemp -t handoff-XXXXXX.md` (read the file before you write to it). + +Suggest the skills to be used, if any, by the next session. + +Do not duplicate content already captured in other artifacts (PRDs, plans, ADRs, issues, commits, diffs). Reference them by path or URL instead. + +If the user passed arguments, treat them as a description of what the next session will focus on and tailor the doc accordingly. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/DEEPENING.md b/scripts/workstation/claude-skills/improve-codebase-architecture/DEEPENING.md new file mode 100644 index 00000000..ecaf5d7d --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/DEEPENING.md @@ -0,0 +1,37 @@ +# Deepening + +How to deepen a cluster of shallow modules safely, given its dependencies. Assumes the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**. + +## Dependency categories + +When assessing a candidate for deepening, classify its dependencies. The category determines how the deepened module is tested across its seam. + +### 1. In-process + +Pure computation, in-memory state, no I/O. Always deepenable — merge the modules and test through the new interface directly. No adapter needed. + +### 2. Local-substitutable + +Dependencies that have local test stand-ins (PGLite for Postgres, in-memory filesystem). Deepenable if the stand-in exists. The deepened module is tested with the stand-in running in the test suite. The seam is internal; no port at the module's external interface. + +### 3. Remote but owned (Ports & Adapters) + +Your own services across a network boundary (microservices, internal APIs). Define a **port** (interface) at the seam. The deep module owns the logic; the transport is injected as an **adapter**. Tests use an in-memory adapter. Production uses an HTTP/gRPC/queue adapter. + +Recommendation shape: *"Define a port at the seam, implement an HTTP adapter for production and an in-memory adapter for testing, so the logic sits in one deep module even though it's deployed across a network."* + +### 4. True external (Mock) + +Third-party services (Stripe, Twilio, etc.) you don't control. The deepened module takes the external dependency as an injected port; tests provide a mock adapter. + +## Seam discipline + +- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a port unless at least two adapters are justified (typically production + test). A single-adapter seam is just indirection. +- **Internal seams vs external seams.** A deep module can have internal seams (private to its implementation, used by its own tests) as well as the external seam at its interface. Don't expose internal seams through the interface just because tests use them. + +## Testing strategy: replace, don't layer + +- Old unit tests on shallow modules become waste once tests at the deepened module's interface exist — delete them. +- Write new tests at the deepened module's interface. The **interface is the test surface**. +- Tests assert on observable outcomes through the interface, not internal state. +- Tests should survive internal refactors — they describe behaviour, not implementation. If a test has to change when the implementation changes, it's testing past the interface. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/HTML-REPORT.md b/scripts/workstation/claude-skills/improve-codebase-architecture/HTML-REPORT.md new file mode 100644 index 00000000..8adc368f --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/HTML-REPORT.md @@ -0,0 +1,123 @@ +# HTML Report Format + +The architectural review is rendered as a single self-contained HTML file in the OS temp directory. Tailwind and Mermaid both come from CDNs. Mermaid handles graph-shaped diagrams reliably; hand-built divs and inline SVG handle the more editorial visuals (mass diagrams, cross-sections). Mix the two — don't lean on Mermaid for everything, it'll start to look generic. + +## Scaffold + +```html +<!doctype html> +<html lang="en"> + <head> + <meta charset="utf-8" /> + <title>Architecture review — {{repo name}} + + + + + +
+
...
+
...
+
...
+
+ + +``` + +## Header + +Repo name, date, and a compact legend: solid box = module, dashed line = seam, red arrow = leakage, thick dark box = deep module. No introduction paragraph — straight into the candidates. + +## Candidate card + +The diagrams carry the weight. Prose is sparse, plain, and uses the glossary terms ([LANGUAGE.md](LANGUAGE.md)) without ceremony. + +Each candidate is one `
`: + +- **Title** — short, names the deepening (e.g. "Collapse the Order intake pipeline"). +- **Badge row** — recommendation strength (`Strong` = emerald, `Worth exploring` = amber, `Speculative` = slate), plus a tag for the dependency category (`in-process`, `local-substitutable`, `ports & adapters`, `mock`). +- **Files** — monospaced list, `font-mono text-sm`. +- **Before / After diagram** — the centrepiece. Two columns, side by side. See patterns below. +- **Problem** — one sentence. What hurts. +- **Solution** — one sentence. What changes. +- **Wins** — bullets, ≤6 words each. e.g. "Tests hit one interface", "Pricing logic stops leaking", "Delete 4 shallow wrappers". +- **ADR callout** (if applicable) — one line in an amber-tinted box. + +No paragraphs of explanation. If the diagram needs a paragraph to be understood, redraw the diagram. + +## Diagram patterns + +Pick the pattern that fits the candidate. Mix them. Don't make every diagram look the same — variety is part of the point. + +### Mermaid graph (the workhorse for dependencies / call flow) + +Use a Mermaid `flowchart` or `graph` when the point is "X calls Y calls Z, and look at the mess." Wrap it in a Tailwind-styled card so it doesn't feel parachuted in. Style with classDef to colour leakage edges red and the deep module dark. Sequence diagrams work well for "before: 6 round-trips; after: 1." + +```html +
+
+    flowchart LR
+      A[OrderHandler] --> B[OrderValidator]
+      B --> C[OrderRepo]
+      C -.leak.-> D[PricingClient]
+      classDef leak stroke:#dc2626,stroke-width:2px;
+      class C,D leak
+  
+
+``` + +### Hand-built boxes-and-arrows (when Mermaid's layout fights you) + +Modules as `
`s with borders and labels. Arrows as inline SVG `` or `` elements positioned absolutely over a relative container. Reach for this when you want the "after" diagram to feel like one thick-bordered deep module with greyed-out internals — Mermaid won't render that with the right weight. + +### Cross-section (good for layered shallowness) + +Stack horizontal bands (`h-12 border-l-4`) to show layers a call passes through. Before: 6 thin layers each doing nothing. After: 1 thick band labelled with the consolidated responsibility. + +### Mass diagram (good for "interface as wide as implementation") + +Two rectangles per module — one for interface surface area, one for implementation. Before: interface rectangle is nearly as tall as the implementation rectangle (shallow). After: interface rectangle is short, implementation rectangle is tall (deep). + +### Call-graph collapse + +Before: a tree of function calls rendered as nested boxes. After: the same tree collapsed into one box, with the now-internal calls shown faded inside it. + +## Style guidance + +- Lean editorial, not corporate-dashboard. Generous whitespace. Serif optional for headings (`font-serif` works well with stone/slate). +- Colour sparingly: one accent (emerald or indigo) plus red for leakage and amber for warnings. +- Keep diagrams ~320px tall so before/after sits comfortably side by side without scrolling. +- Use `text-xs uppercase tracking-wider` for module labels inside diagrams — they should read as schematic, not as UI. +- The only scripts are the Tailwind CDN and the Mermaid ESM import. The report is otherwise static — no app code, no interactivity beyond Mermaid's own rendering. + +## Top recommendation section + +One larger card. Candidate name, one sentence on why, anchor link to its card. That's it. + +## Tone + +Plain English, concise — but the architectural nouns and verbs come straight from [LANGUAGE.md](LANGUAGE.md). Concision is not an excuse to drift. + +**Use exactly:** module, interface, implementation, depth, deep, shallow, seam, adapter, leverage, locality. + +**Never substitute:** component, service, unit (for module) · API, signature (for interface) · boundary (for seam) · layer, wrapper (for module, when you mean module). + +**Phrasings that fit the style:** + +- "Order intake module is shallow — interface nearly matches the implementation." +- "Pricing leaks across the seam." +- "Deepen: one interface, one place to test." +- "Two adapters justify the seam: HTTP in prod, in-memory in tests." + +**Wins bullets** name the gain in glossary terms: *"locality: bugs concentrate in one module"*, *"leverage: one interface, N call sites"*, *"interface shrinks; implementation absorbs the wrappers"*. Don't write *"easier to maintain"* or *"cleaner code"* — those terms aren't in the glossary and don't earn their place. + +No hedging, no throat-clearing, no "it's worth noting that…". If a sentence could be a bullet, make it a bullet. If a bullet could be cut, cut it. If a term isn't in [LANGUAGE.md](LANGUAGE.md), reach for one that is before inventing a new one. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/INTERFACE-DESIGN.md b/scripts/workstation/claude-skills/improve-codebase-architecture/INTERFACE-DESIGN.md new file mode 100644 index 00000000..3197723a --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/INTERFACE-DESIGN.md @@ -0,0 +1,44 @@ +# Interface Design + +When the user wants to explore alternative interfaces for a chosen deepening candidate, use this parallel sub-agent pattern. Based on "Design It Twice" (Ousterhout) — your first idea is unlikely to be the best. + +Uses the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**, **leverage**. + +## Process + +### 1. Frame the problem space + +Before spawning sub-agents, write a user-facing explanation of the problem space for the chosen candidate: + +- The constraints any new interface would need to satisfy +- The dependencies it would rely on, and which category they fall into (see [DEEPENING.md](DEEPENING.md)) +- A rough illustrative code sketch to ground the constraints — not a proposal, just a way to make the constraints concrete + +Show this to the user, then immediately proceed to Step 2. The user reads and thinks while the sub-agents work in parallel. + +### 2. Spawn sub-agents + +Spawn 3+ sub-agents in parallel using the Agent tool. Each must produce a **radically different** interface for the deepened module. + +Prompt each sub-agent with a separate technical brief (file paths, coupling details, dependency category from [DEEPENING.md](DEEPENING.md), what sits behind the seam). The brief is independent of the user-facing problem-space explanation in Step 1. Give each agent a different design constraint: + +- Agent 1: "Minimize the interface — aim for 1–3 entry points max. Maximise leverage per entry point." +- Agent 2: "Maximise flexibility — support many use cases and extension." +- Agent 3: "Optimise for the most common caller — make the default case trivial." +- Agent 4 (if applicable): "Design around ports & adapters for cross-seam dependencies." + +Include both [LANGUAGE.md](LANGUAGE.md) vocabulary and CONTEXT.md vocabulary in the brief so each sub-agent names things consistently with the architecture language and the project's domain language. + +Each sub-agent outputs: + +1. Interface (types, methods, params — plus invariants, ordering, error modes) +2. Usage example showing how callers use it +3. What the implementation hides behind the seam +4. Dependency strategy and adapters (see [DEEPENING.md](DEEPENING.md)) +5. Trade-offs — where leverage is high, where it's thin + +### 3. Present and compare + +Present designs sequentially so the user can absorb each one, then compare them in prose. Contrast by **depth** (leverage at the interface), **locality** (where change concentrates), and **seam placement**. + +After comparing, give your own recommendation: which design you think is strongest and why. If elements from different designs would combine well, propose a hybrid. Be opinionated — the user wants a strong read, not a menu. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/LANGUAGE.md b/scripts/workstation/claude-skills/improve-codebase-architecture/LANGUAGE.md new file mode 100644 index 00000000..530c2763 --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/LANGUAGE.md @@ -0,0 +1,53 @@ +# Language + +Shared vocabulary for every suggestion this skill makes. Use these terms exactly — don't substitute "component," "service," "API," or "boundary." Consistent language is the whole point. + +## Terms + +**Module** +Anything with an interface and an implementation. Deliberately scale-agnostic — applies equally to a function, class, package, or tier-spanning slice. +_Avoid_: unit, component, service. + +**Interface** +Everything a caller must know to use the module correctly. Includes the type signature, but also invariants, ordering constraints, error modes, required configuration, and performance characteristics. +_Avoid_: API, signature (too narrow — those refer only to the type-level surface). + +**Implementation** +What's inside a module — its body of code. Distinct from **Adapter**: a thing can be a small adapter with a large implementation (a Postgres repo) or a large adapter with a small implementation (an in-memory fake). Reach for "adapter" when the seam is the topic; "implementation" otherwise. + +**Depth** +Leverage at the interface — the amount of behaviour a caller (or test) can exercise per unit of interface they have to learn. A module is **deep** when a large amount of behaviour sits behind a small interface. A module is **shallow** when the interface is nearly as complex as the implementation. + +**Seam** _(from Michael Feathers)_ +A place where you can alter behaviour without editing in that place. The *location* at which a module's interface lives. Choosing where to put the seam is its own design decision, distinct from what goes behind it. +_Avoid_: boundary (overloaded with DDD's bounded context). + +**Adapter** +A concrete thing that satisfies an interface at a seam. Describes *role* (what slot it fills), not substance (what's inside). + +**Leverage** +What callers get from depth. More capability per unit of interface they have to learn. One implementation pays back across N call sites and M tests. + +**Locality** +What maintainers get from depth. Change, bugs, knowledge, and verification concentrate at one place rather than spreading across callers. Fix once, fixed everywhere. + +## Principles + +- **Depth is a property of the interface, not the implementation.** A deep module can be internally composed of small, mockable, swappable parts — they just aren't part of the interface. A module can have **internal seams** (private to its implementation, used by its own tests) as well as the **external seam** at its interface. +- **The deletion test.** Imagine deleting the module. If complexity vanishes, the module wasn't hiding anything (it was a pass-through). If complexity reappears across N callers, the module was earning its keep. +- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test *past* the interface, the module is probably the wrong shape. +- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a seam unless something actually varies across it. + +## Relationships + +- A **Module** has exactly one **Interface** (the surface it presents to callers and tests). +- **Depth** is a property of a **Module**, measured against its **Interface**. +- A **Seam** is where a **Module**'s **Interface** lives. +- An **Adapter** sits at a **Seam** and satisfies the **Interface**. +- **Depth** produces **Leverage** for callers and **Locality** for maintainers. + +## Rejected framings + +- **Depth as ratio of implementation-lines to interface-lines** (Ousterhout): rewards padding the implementation. We use depth-as-leverage instead. +- **"Interface" as the TypeScript `interface` keyword or a class's public methods**: too narrow — interface here includes every fact a caller must know. +- **"Boundary"**: overloaded with DDD's bounded context. Say **seam** or **interface**. diff --git a/scripts/workstation/claude-skills/improve-codebase-architecture/SKILL.md b/scripts/workstation/claude-skills/improve-codebase-architecture/SKILL.md new file mode 100644 index 00000000..c12b263b --- /dev/null +++ b/scripts/workstation/claude-skills/improve-codebase-architecture/SKILL.md @@ -0,0 +1,81 @@ +--- +name: improve-codebase-architecture +description: Find deepening opportunities in a codebase, informed by the domain language in CONTEXT.md and the decisions in docs/adr/. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable. +--- + +# Improve Codebase Architecture + +Surface architectural friction and propose **deepening opportunities** — refactors that turn shallow modules into deep ones. The aim is testability and AI-navigability. + +## Glossary + +Use these terms exactly in every suggestion. Consistent language is the point — don't drift into "component," "service," "API," or "boundary." Full definitions in [LANGUAGE.md](LANGUAGE.md). + +- **Module** — anything with an interface and an implementation (function, class, package, slice). +- **Interface** — everything a caller must know to use the module: types, invariants, error modes, ordering, config. Not just the type signature. +- **Implementation** — the code inside. +- **Depth** — leverage at the interface: a lot of behaviour behind a small interface. **Deep** = high leverage. **Shallow** = interface nearly as complex as the implementation. +- **Seam** — where an interface lives; a place behaviour can be altered without editing in place. (Use this, not "boundary.") +- **Adapter** — a concrete thing satisfying an interface at a seam. +- **Leverage** — what callers get from depth. +- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated in one place. + +Key principles (see [LANGUAGE.md](LANGUAGE.md) for the full list): + +- **Deletion test**: imagine deleting the module. If complexity vanishes, it was a pass-through. If complexity reappears across N callers, it was earning its keep. +- **The interface is the test surface.** +- **One adapter = hypothetical seam. Two adapters = real seam.** + +This skill is _informed_ by the project's domain model. The domain language gives names to good seams; ADRs record decisions the skill should not re-litigate. + +## Process + +### 1. Explore + +Read the project's domain glossary and any ADRs in the area you're touching first. + +Then use the Agent tool with `subagent_type=Explore` to walk the codebase. Don't follow rigid heuristics — explore organically and note where you experience friction: + +- Where does understanding one concept require bouncing between many small modules? +- Where are modules **shallow** — interface nearly as complex as the implementation? +- Where have pure functions been extracted just for testability, but the real bugs hide in how they're called (no **locality**)? +- Where do tightly-coupled modules leak across their seams? +- Which parts of the codebase are untested, or hard to test through their current interface? + +Apply the **deletion test** to anything you suspect is shallow: would deleting it concentrate complexity, or just move it? A "yes, concentrates" is the signal you want. + +### 2. Present candidates as an HTML report + +Write a self-contained HTML file to the OS temp directory so nothing lands in the repo. Resolve the temp dir from `$TMPDIR`, falling back to `/tmp` (or `%TEMP%` on Windows), and write to `/architecture-review-.html` so each run gets a fresh file. Open it for the user — `xdg-open ` on Linux, `open ` on macOS, `start ` on Windows — and tell them the absolute path. + +The report uses **Tailwind via CDN** for layout and styling, and **Mermaid via CDN** for diagrams where a graph/flow/sequence reliably communicates the structure. Mix Mermaid with hand-crafted CSS/SVG visuals — use Mermaid when relationships are graph-shaped (call graphs, dependencies, sequences), and hand-built divs/SVG when you want something more editorial (mass diagrams, cross-sections, collapse animations). Each candidate gets a **before/after visualisation**. Be visual. + +For each candidate, the same template as before, but rendered as a card: + +- **Files** — which files/modules are involved +- **Problem** — why the current architecture is causing friction +- **Solution** — plain English description of what would change +- **Benefits** — explained in terms of locality and leverage, and how tests would improve +- **Before / After diagram** — side-by-side, custom-drawn, illustrating the shallowness and the deepening +- **Recommendation strength** — one of `Strong`, `Worth exploring`, `Speculative`, rendered as a badge + +End the report with a **Top recommendation** section: which candidate you'd tackle first and why. + +**Use CONTEXT.md vocabulary for the domain, and [LANGUAGE.md](LANGUAGE.md) vocabulary for the architecture.** If `CONTEXT.md` defines "Order," talk about "the Order intake module" — not "the FooBarHandler," and not "the Order service." + +**ADR conflicts**: if a candidate contradicts an existing ADR, only surface it when the friction is real enough to warrant revisiting the ADR. Mark it clearly in the card (e.g. a warning callout: _"contradicts ADR-0007 — but worth reopening because…"_). Don't list every theoretical refactor an ADR forbids. + +See [HTML-REPORT.md](HTML-REPORT.md) for the full HTML scaffold, diagram patterns, and styling guidance. + +Do NOT propose interfaces yet. After the file is written, ask the user: "Which of these would you like to explore?" + +### 3. Grilling loop + +Once the user picks a candidate, drop into a grilling conversation. Walk the design tree with them — constraints, dependencies, the shape of the deepened module, what sits behind the seam, what tests survive. + +Side effects happen inline as decisions crystallize: + +- **Naming a deepened module after a concept not in `CONTEXT.md`?** Add the term to `CONTEXT.md` — same discipline as `/grill-with-docs` (see [CONTEXT-FORMAT.md](../grill-with-docs/CONTEXT-FORMAT.md)). Create the file lazily if it doesn't exist. +- **Sharpening a fuzzy term during the conversation?** Update `CONTEXT.md` right there. +- **User rejects the candidate with a load-bearing reason?** Offer an ADR, framed as: _"Want me to record this as an ADR so future architecture reviews don't re-suggest it?"_ Only offer when the reason would actually be needed by a future explorer to avoid re-suggesting the same thing — skip ephemeral reasons ("not worth it right now") and self-evident ones. See [ADR-FORMAT.md](../grill-with-docs/ADR-FORMAT.md). +- **Want to explore alternative interfaces for the deepened module?** See [INTERFACE-DESIGN.md](INTERFACE-DESIGN.md). diff --git a/scripts/workstation/claude-skills/prototype/LOGIC.md b/scripts/workstation/claude-skills/prototype/LOGIC.md new file mode 100644 index 00000000..526ecb18 --- /dev/null +++ b/scripts/workstation/claude-skills/prototype/LOGIC.md @@ -0,0 +1,79 @@ +# Logic Prototype + +A tiny interactive terminal app that lets the user drive a state model by hand. Use this when the question is about **business logic, state transitions, or data shape** — the kind of thing that looks reasonable on paper but only feels wrong once you push it through real cases. + +## When this is the right shape + +- "I'm not sure if this state machine handles the edge case where X then Y." +- "Does this data model actually let me represent the case where..." +- "I want to feel out what the API should look like before writing it." +- Anything where the user wants to **press buttons and watch state change**. + +If the question is "what should this look like" — wrong branch. Use [UI.md](UI.md). + +## Process + +### 1. State the question + +Before writing code, write down what state model and what question you're prototyping. One paragraph, in the prototype's README or a comment at the top of the file. A logic prototype that answers the wrong question is pure waste — make the question explicit so it can be checked later, whether the user is watching now or returning to it AFK. + +### 2. Pick the language + +Use whatever the host project uses. If the project has no obvious runtime (e.g. a docs repo), ask. + +Match the project's existing conventions for tooling — don't add a new package manager or runtime just for the prototype. + +### 3. Isolate the logic in a portable module + +Put the actual logic — the bit that's answering the question — behind a small, pure interface that could be lifted out and dropped into the real codebase later. The TUI around it is throwaway; the logic module shouldn't be. + +The right shape depends on the question: + +- **A pure reducer** — `(state, action) => state`. Good when actions are discrete events and state is a single value. +- **A state machine** — explicit states and transitions. Good when "which actions are even legal right now" is part of the question. +- **A small set of pure functions** over a plain data type. Good when there's no implicit current state — just transformations. +- **A class or module with a clear method surface** when the logic genuinely owns ongoing internal state. + +Pick whichever shape best fits the question being asked, *not* whichever is easiest to wire to a TUI. Keep it pure: no I/O, no terminal code, no `console.log` for control flow. The TUI imports it and calls into it; nothing flows the other direction. + +This is what makes the prototype useful past its own lifetime. When the question's been answered, the validated reducer / machine / function set can be lifted into the real module — the TUI shell gets deleted. + +### 4. Build the smallest TUI that exposes the state + +Build it as a **lightweight TUI** — on every tick, clear the screen (`console.clear()` / `print("\033[2J\033[H")` / equivalent) and re-render the whole frame. The user should always see one stable view, not an ever-growing scrollback. + +Each frame has two parts, in this order: + +1. **Current state**, pretty-printed and diff-friendly (one field per line, or formatted JSON). Use **bold** for field names or section headers and **dim** for less important context (timestamps, IDs, derived values). Native ANSI escape codes are fine — `\x1b[1m` bold, `\x1b[2m` dim, `\x1b[0m` reset. No need to pull in a styling library unless one is already in the project. +2. **Keyboard shortcuts**, listed at the bottom: `[a] add user [d] delete user [t] tick clock [q] quit`. Bold the key, dim the description, or vice-versa — whatever reads cleanly. + +Behaviour: + +1. **Initialise state** — a single in-memory object/struct. Render the first frame on start. +2. **Read one keystroke (or one line)** at a time, dispatch to a handler that mutates state. +3. **Re-render** the full frame after every action — don't append, replace. +4. **Loop until quit.** + +The whole frame should fit on one screen. + +### 5. Make it runnable in one command + +Add a script to the project's existing task runner (`package.json` scripts, `Makefile`, `justfile`, `pyproject.toml`). The user should run `pnpm run ` or equivalent — never need to remember a path. + +If the host project has no task runner, just put the command at the top of the prototype's README. + +### 6. Hand it over + +Give the user the run command. They'll drive it themselves; the interesting moments are when they say "wait, that shouldn't be possible" or "huh, I assumed X would be different" — those are the bugs in the _idea_, which is the whole point. If they want new actions added, add them. Prototypes evolve. + +### 7. Capture the answer + +When the prototype has done its job, the answer to the question is the only thing worth keeping. If the user is around, ask what it taught them. If not, leave a `NOTES.md` next to the prototype so the answer can be filled in (or filled in by you, if you've watched the session) before the prototype gets deleted. + +## Anti-patterns + +- **Don't add tests.** A prototype that needs tests is no longer a prototype. +- **Don't wire it to the real database.** Use an in-memory store unless the question is specifically about persistence. +- **Don't generalise.** No "what if we wanted to support X later." The prototype answers one question. +- **Don't blur the logic and the TUI together.** If the reducer / state machine references `console.log`, prompts, or terminal escape codes, it's no longer portable. Keep the TUI as a thin shell over a pure module. +- **Don't ship the TUI shell into production.** The shell is optimised for being driven by hand from a terminal. The logic module behind it is the bit worth keeping. diff --git a/scripts/workstation/claude-skills/prototype/SKILL.md b/scripts/workstation/claude-skills/prototype/SKILL.md new file mode 100644 index 00000000..64f3e611 --- /dev/null +++ b/scripts/workstation/claude-skills/prototype/SKILL.md @@ -0,0 +1,30 @@ +--- +name: prototype +description: Build a throwaway prototype to flesh out a design before committing to it. Routes between two branches — a runnable terminal app for state/business-logic questions, or several radically different UI variations toggleable from one route. Use when the user wants to prototype, sanity-check a data model or state machine, mock up a UI, explore design options, or says "prototype this", "let me play with it", "try a few designs". +--- + +# Prototype + +A prototype is **throwaway code that answers a question**. The question decides the shape. + +## Pick a branch + +Identify which question is being answered — from the user's prompt, the surrounding code, or by asking if the user is around: + +- **"Does this logic / state model feel right?"** → [LOGIC.md](LOGIC.md). Build a tiny interactive terminal app that pushes the state machine through cases that are hard to reason about on paper. +- **"What should this look like?"** → [UI.md](UI.md). Generate several radically different UI variations on a single route, switchable via a URL search param and a floating bottom bar. + +The two branches produce very different artifacts — getting this wrong wastes the whole prototype. If the question is genuinely ambiguous and the user isn't reachable, default to whichever branch better matches the surrounding code (a backend module → logic; a page or component → UI) and state the assumption at the top of the prototype. + +## Rules that apply to both + +1. **Throwaway from day one, and clearly marked as such.** Locate the prototype code close to where it will actually be used (next to the module or page it's prototyping for) so context is obvious — but name it so a casual reader can see it's a prototype, not production. For throwaway UI routes, obey whatever routing convention the project already uses; don't invent a new top-level structure. +2. **One command to run.** Whatever the project's existing task runner supports — `pnpm `, `python `, `bun `, etc. The user must be able to start it without thinking. +3. **No persistence by default.** State lives in memory. Persistence is the thing the prototype is _checking_, not something it should depend on. If the question explicitly involves a database, hit a scratch DB or a local file with a clear "PROTOTYPE — wipe me" name. +4. **Skip the polish.** No tests, no error handling beyond what makes the prototype _runnable_, no abstractions. The point is to learn something fast and then delete it. +5. **Surface the state.** After every action (logic) or on every variant switch (UI), print or render the full relevant state so the user can see what changed. +6. **Delete or absorb when done.** When the prototype has answered its question, either delete it or fold the validated decision into the real code — don't leave it rotting in the repo. + +## When done + +The _answer_ is the only thing worth keeping from a prototype. Capture it somewhere durable (commit message, ADR, issue, or a `NOTES.md` next to the prototype) along with the question it was answering. If the user is around, that capture is a quick conversation; if not, leave the placeholder so they (or you, on the next pass) can fill in the verdict before deleting the prototype. diff --git a/scripts/workstation/claude-skills/prototype/UI.md b/scripts/workstation/claude-skills/prototype/UI.md new file mode 100644 index 00000000..f3b6e640 --- /dev/null +++ b/scripts/workstation/claude-skills/prototype/UI.md @@ -0,0 +1,112 @@ +# UI Prototype + +Generate **several radically different UI variations** on a single route, switchable from a floating bottom bar. The user flips between variants in the browser, picks one (or steals bits from each), then throws the rest away. + +If the question is about logic/state rather than what something looks like — wrong branch. Use [LOGIC.md](LOGIC.md). + +## When this is the right shape + +- "What should this page look like?" +- "I want to see a few options for this dashboard before committing." +- "Try a different layout for the settings screen." +- Any time the user would otherwise spend a day picking between three vague mockups in their head. + +## Two sub-shapes — strongly prefer sub-shape A + +A UI prototype is much easier to judge when it's **butting up against the rest of the app** — real header, real sidebar, real data, real density. A throwaway route on its own is a vacuum: every variant looks fine in isolation. Default to sub-shape A whenever there's a plausible existing page to host the variants. Only reach for sub-shape B if the prototype genuinely has no nearby home. + +### Sub-shape A — adjustment to an existing page (preferred) + +The route already exists. Variants are rendered **on the same route**, gated by a `?variant=` URL search param. The existing data fetching, params, and auth all stay — only the rendering swaps. This is the default; pick it unless there's a specific reason not to. + +If the prototype is for something that doesn't yet have a page but *would naturally live inside one* (a new section of the dashboard, a new card on the settings screen, a new step in an existing flow) — that's still sub-shape A. Mount the variants inside the host page. + +### Sub-shape B — a new page (last resort) + +Only use this when the thing being prototyped genuinely has no existing page to live inside — e.g. an entirely new top-level surface, or a flow that can't be embedded anywhere sensible. + +Create a **throwaway route** following whatever routing convention the project already uses — don't invent a new top-level structure. Name it so it's obviously a prototype (e.g. include the word `prototype` in the path or filename). Same `?variant=` pattern. + +Before committing to sub-shape B, sanity-check: is there really no existing page this could be embedded in? An empty route hides design problems that a populated one would expose. + +In both sub-shapes the floating bottom bar is identical. + +## Process + +### 1. State the question and pick N + +Default to **3 variants**. More than 5 stops being radically different and starts being noise — cap there. + +Write down the plan in one line, in the prototype's location or a top-of-file comment: + +> "Three variants of the settings page, switchable via `?variant=`, on the existing `/settings` route." + +This works whether the user is here to push back or not. + +### 2. Generate radically different variants + +Draft each variant. Hold each one to: + +- The page's purpose and the data it has access to. +- The project's component library / styling system (TailwindCSS, shadcn, MUI, plain CSS, whatever). +- A clear exported component name, e.g. `VariantA`, `VariantB`, `VariantC`. + +Variants must be **structurally different** — different layout, different information hierarchy, different primary affordance, not just different colours. Three slightly-tweaked card grids isn't a UI prototype, it's wallpaper. If two drafts come out too similar, redo one with explicit "do not use a card grid" guidance. + +### 3. Wire them together + +Create a single switcher component on the route: + +```tsx +// pseudo-code — adapt to the project's framework +const variant = searchParams.get('variant') ?? 'A'; +return ( + <> + {variant === 'A' && } + {variant === 'B' && } + {variant === 'C' && } + + +); +``` + +For sub-shape A (existing page): keep all the existing data fetching above the switcher; only the rendered subtree changes per variant. + +For sub-shape B (new page): the throwaway route under `/prototype/` mounts the same switcher. + +### 4. Build the floating switcher + +A small fixed-position bar at the bottom-centre of the screen with three pieces: + +- **Left arrow** — cycles to the previous variant (wraps around). +- **Variant label** — shows the current variant key and, if the variant exports a name, that name too. e.g. `B — Sidebar layout`. +- **Right arrow** — cycles forward (wraps around). + +Behaviour: + +- Clicking an arrow updates the URL search param (use the framework's router — `router.replace` on Next, `navigate` on React Router, etc) so the variant is shareable and reload-stable. +- Keyboard: `←` and `→` arrow keys also cycle. Don't intercept arrow keys when an ``, `