From 6e4db0ddc61b854aef6611bc10d6e945d8dbe0e8 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 12:36:10 +0000 Subject: [PATCH 01/36] openclaw + f1-stream: last forgejo image refs -> ghcr (ADR-0002 #32 prep) openclaw's install-nextcloud-todos-plugin init still pulled forgejo nextcloud-todos (would ImagePullBackOff on restart once the forgejo registry is wiped) -> ghcr:latest. f1-stream stack base (KEEL_IGNORE'd, live already ghcr via set-image) repointed for fresh-create correctness. Clears the last LIVE forgejo viktor/* refs before the registry reclaim. Co-Authored-By: Claude Fable 5 --- stacks/f1-stream/main.tf | 2 +- stacks/openclaw/main.tf | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/stacks/f1-stream/main.tf b/stacks/f1-stream/main.tf index 7666458d..11ff8cd4 100644 --- a/stacks/f1-stream/main.tf +++ b/stacks/f1-stream/main.tf @@ -128,7 +128,7 @@ resource "kubernetes_deployment" "f1-stream" { } spec { container { - image = "forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}" + image = "ghcr.io/viktorbarzin/f1-stream:${var.image_tag}" image_pull_policy = "Always" name = "f1-stream" # Right-sized 2026-06-05: was 1Gi (bundled-Chromium era). The image is diff --git a/stacks/openclaw/main.tf b/stacks/openclaw/main.tf index 5a142af2..925c6675 100644 --- a/stacks/openclaw/main.tf +++ b/stacks/openclaw/main.tf @@ -553,7 +553,7 @@ resource "kubernetes_deployment" "openclaw" { # IfNotPresent: a cached stale :latest meant the plugin manifest # (configSchema fix) never got pulled. An uncached SHA forces the # pull. Bump this when the openclaw plugin in nextcloud-todos changes. - image = "forgejo.viktorbarzin.me/viktor/nextcloud-todos:f85c6de1" + image = "ghcr.io/viktorbarzin/nextcloud-todos:latest" image_pull_policy = "Always" command = ["sh", "-c", <<-EOT set -eu From 3e82c64a7659e1aa8d2a108bc2edb806f41605c5 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 12:55:49 +0000 Subject: [PATCH 02/36] docs: sync CI/CD docs to ADR-0002 final state (ghcr + Woodpecker deploy-only) [ci skip] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ADR-0002 is fully landed (issues #11-#32 closed): every owned image now builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/, with Woodpecker reduced to deploy-only. The Forgejo container registry is frozen and emptied; there are no in-cluster image builds or CI test runs anywhere. The docs still described the old hybrid topology (DockerHub builds, Woodpecker-native owned-app builds, the per-pattern migration lists, the tripit-only pilot framing), which would mislead future sessions and incident response. This brings the docs to the completed reality (closes #33): - docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference — the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen Forgejo registry, what Woodpecker still runs, and the #31 decommissions. - .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the fleet-wide final state; FIX the stale claim that claude-memory-mcp builds to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the Forgejo registry is frozen/break-glass near the image-registry bullet. - .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker deploy-only (was "Woodpecker-native build->deploy"). - stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf: cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no CI pipeline). Description/comment text only — no stack logic changed. Historical records (docs/post-mortems/*, docs/plans/*) and ADR-0002 itself are left untouched as point-in-time records. Co-Authored-By: Claude Fable 5 --- .claude/CLAUDE.md | 131 ++++--- .claude/reference/service-catalog.md | 2 +- docs/architecture/ci-cd.md | 530 ++++++++++++++------------- stacks/android-emulator/variables.tf | 2 +- stacks/terminal/main.tf | 7 +- stacks/tuya-bridge/variables.tf | 2 +- 6 files changed, 379 insertions(+), 295 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index d0bc9444..37ab99f3 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -38,7 +38,7 @@ Violations cause state drift, which causes future applies to break or silently r - **DNS**: `dns_type = "proxied"` (Cloudflare CDN) or `"non-proxied"` (direct A/AAAA). DNS records are auto-created — no need to edit `config.tfvars`. Smoke-test target: `echo.viktorbarzin.me` (auth=public, header-reflecting backend). - **Anubis PoW challenge** (`modules/kubernetes/anubis_instance/`): per-site reverse proxy that issues a 30-day JWT cookie after a tiny PoW solve. Use for **public, content-bearing sites without app-level auth** (blog, docs, wikis, static landing pages). Pattern: declare `module "anubis" { source = "../../modules/kubernetes/anubis_instance"; name = "X"; namespace = ...; target_url = "http://..svc.cluster.local" }`, then in `ingress_factory` set `service_name = module.anubis.service_name`, `port = module.anubis.service_port`, `anti_ai_scraping = false`. Shared ed25519 key in Vault `secret/viktor` -> `anubis_ed25519_key`; cookie scoped to `viktorbarzin.me` so one solve covers all Anubis-fronted subdomains. **DO NOT put Anubis in front of Git/API/WebDAV/CLI endpoints** — clients without JS can't solve PoW. **Replicas default to 1** because Anubis stores in-flight challenges in process memory; a challenge issued by pod A and solved against pod B errors with `store: key not found` (HTTP 500). Bumping replicas requires wiring a shared Redis store (TODO). For path-level carve-outs (e.g. wrongmove has `/` behind Anubis but `/api` direct, blog has `/net-diag.sh` direct), declare a second `ingress_factory` with `ingress_path = ["/"]` pointing at the bare backend service. Active on: blog (except `/net-diag.sh`), www, kms, travel, f1, cc, json, pb (privatebin), home (homepage), wrongmove (UI only). See `.claude/reference/patterns.md` "Anti-AI Scraping" for full layering. - **Docker images**: Always build for `linux/amd64`. SHA-tag rule is being phased out — see `docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md`. New model: CI pushes `:latest` (optionally also `:<8-char-sha>` for traceability), Keel polls and triggers rollouts. Cache-staleness concern from the old rule is resolved at the nginx layer (URL-split — manifests pass through, blobs cached). Until Phase 1 of the migration completes (per the plan), follow the SHA-tag rule for new services to match existing pattern. -- **Private registry**: `forgejo.viktorbarzin.me/viktor/` (Forgejo packages, OAuth-style PAT auth). Use `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. Kyverno auto-syncs the Secret to all namespaces. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are ordinary internal clients too** (since 2026-06-10 evening) — CoreDNS's dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`) forwards to the Technitium ClusterIP `10.96.0.53`, so pods get the same split-horizon answers as everyone else; forgejo stays pinned to Traefik's **ClusterIP** in that block (TF-interpolated from the live Service) so CI pushes survive a Technitium outage. This relies on a k8s-1.34 behavior verified 2026-06-10: **pods CAN reach the ETP=Local Traefik LB IP** (kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path) — re-verify after major k8s upgrades; canary = the uptime-kuma `[External]` fleet going red. (The block briefly forwarded to `8.8.8.8/1.1.1.1` earlier that day, which kept pods on the WAN IP and the broken TP-Link NAT loopback — 27 non-proxied `[External]` monitors dark; beads code-yh33.) **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. +- **Image registry**: **Owned images now live on `ghcr.io/viktorbarzin/`** (ADR-0002, built by GHA — see the CI/CD Architecture section). The **Forgejo container registry is FROZEN + emptied** (break-glass only — `docs/runbooks/forgejo-registry-breakglass.md`); nothing pushes to it. The rest of this bullet documents the **still-live forgejo-pull DNS/mirror machinery** (it remains in place for the break-glass path + because `registry-credentials` is still Kyverno-synced; the hairpin lessons apply to any internal-registry pull). Historical usage was `image: forgejo.viktorbarzin.me/viktor/:` + `imagePullSecrets: [{name: registry-credentials}]`. **Kubelet pulls** are kept off the hairpin **at the resolver, with zero node-side DNS config**: pfSense Unbound carries a domain override forwarding the whole `viktorbarzin.me` zone to Technitium (added 2026-06-10, `docs/runbooks/pfsense-unbound.md`), whose split-horizon zone CNAMEs every ingress host (auto-synced hourly by `technitium-ingress-dns-sync`) to the zone apex whose A record tracks the **live** Traefik LB IP (canary: `viktorbarzin-apex-probe`, alerts ViktorBarzinApexDrift). Nodes are stock — link DNS `10.0.20.1 94.140.14.14` via `qm set --nameserver`, no `/etc/hosts` pins, no resolved drop-ins (two same-day interim approaches on 2026-06-10 were removed the same day). The containerd `hosts.toml` mirror (`[host."https://10.0.20.203"]`, `skip_verify = true`) still exists but is **vestigial** — it can NOT keep pulls internal on its own: Traefik routes by Host/SNI and 404s the mirror's bare-IP requests, and the registry's Bearer auth realm is the absolute `https://forgejo.viktorbarzin.me/v2/token` URL fetched outside the mirror — without internal DNS every fresh pull degrades to public DNS → hairpin → intermittent `dial tcp 176.12.22.76:443: i/o timeout` ImagePullBackOff (tuya-bridge 7.5h outage 2026-06-10, tripit 2026-06-09; see `docs/post-mortems/2026-06-10-tuya-bridge-forgejo-pull-hairpin.md`). **In-cluster pods are ordinary internal clients too** (since 2026-06-10 evening) — CoreDNS's dedicated `viktorbarzin.me:53` block (Corefile in `stacks/technitium/modules/technitium/main.tf`) forwards to the Technitium ClusterIP `10.96.0.53`, so pods get the same split-horizon answers as everyone else; forgejo stays pinned to Traefik's **ClusterIP** in that block (TF-interpolated from the live Service) so CI pushes survive a Technitium outage. This relies on a k8s-1.34 behavior verified 2026-06-10: **pods CAN reach the ETP=Local Traefik LB IP** (kube-proxy short-circuits in-cluster traffic to LB IPs via the cluster path) — re-verify after major k8s upgrades; canary = the uptime-kuma `[External]` fleet going red. (The block briefly forwarded to `8.8.8.8/1.1.1.1` earlier that day, which kept pods on the WAN IP and the broken TP-Link NAT loopback — 27 non-proxied `[External]` monitors dark; beads code-yh33.) **Was `.200` until 2026-06-01** — Traefik's 2026-05-30 move to its dedicated `.203` left the mirror pointing at the now-dead `.200:443`, silently breaking every *fresh* forgejo pull; a future LB renumber is now handled by DNS (apex record + drift probe) — only the vestigial hosts.toml literal would go stale. Mirror source lives in `modules/create-template-vm/k8s-node-containerd-setup.sh` (new nodes) and `scripts/setup-forgejo-containerd-mirror.sh` (existing nodes; also cleans up the legacy 2026-06-10 node-DNS customization). Push-side: viktor PAT in Vault `secret/ci/global/forgejo_push_token` (Forgejo container packages are scoped per-user; only the package owner can push, ci-pusher cannot write to viktor/*). Pull-side: cluster-puller PAT in Vault `secret/viktor/forgejo_pull_token`. Retention CronJob (`forgejo-cleanup` in `forgejo` ns, daily 04:00) keeps newest 10 versions + always `:latest` + any buildkit `*cache*` tag — **REVERTED to DRY_RUN 2026-06-10 after its first live run orphaned OCI index children** (multi-arch/attestation children are separate *untagged* sha256 versions that sort outside the newest-10 window while their parent index is kept; broke `kms-website:latest`+`:dfc83fb`, caught by the integrity probe, healed by re-tagging latest→a794d1a + deleting the corrupt version; see `docs/post-mortems/2026-06-10-forgejo-retention-orphaned-indexes.md`). Do NOT re-enable deletes until the keep-set resolves kept indexes' child digests (or skips untagged versions, or moves to Forgejo's native container-aware cleanup rules). The registry PVC remains at its 50Gi autoresize ceiling on the HDD (we did NOT move it to SSD, see beads code-oflt), so a container-aware retention is still needed. Integrity probed every 15min by `forgejo-integrity-probe` in `monitoring` ns (catalog walk + manifest HEAD on every blob). See `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md` for the migration history. Pull-through caches for upstream registries (DockerHub, GHCR, Quay, k8s.gcr, Kyverno) stay on the registry VM at `10.0.20.10` ports 5000/5010/5020/5030/5040 — the old port-5050 R/W private registry was decommissioned 2026-05-07. - **LinuxServer.io containers**: `DOCKER_MODS` runs apt-get on every start — bake slow mods into a custom image (`RUN /docker-mods || true` then `ENV DOCKER_MODS=`). Set `NO_CHOWN=true` to skip recursive chown that hangs on NFS mounts. - **Node memory changes**: When changing VM memory on any k8s node, update kubelet `systemReserved`, `kubeReserved`, and eviction thresholds accordingly. Config: `/var/lib/kubelet/config.yaml`. Template: `stacks/infra/main.tf`. Current values: systemReserved=512Mi, kubeReserved=512Mi, evictionHard=500Mi, evictionSoft=1Gi. - **Node OS disk tuning** (in `stacks/infra/main.tf`): kubelet `imageGCHighThresholdPercent=70` (was 85), `imageGCLowThresholdPercent=60` (was 80), ext4 `commit=60` in fstab (was default 5s), journald `SystemMaxUse=200M` + `MaxRetentionSec=3day`. @@ -87,62 +87,103 @@ Violations cause state drift, which causes future applies to break or silently r - **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis. - **Quarterly right-sizing**: Run `krr` (Dockerized, against Prometheus) for recommendations; compare to current requests and adjust in TF. (Goldilocks dashboard removed 2026-06-12.) -## CI/CD Architecture — GHA Builds + Woodpecker Deploy +## CI/CD Architecture — GHA Builds → ghcr + Woodpecker Deploy -**Doctrine (ADR-0002): leverage external infra for ALL CI compute.** Builds, -tests, lint, and release jobs run on GitHub Actions hosted runners (public -repos: unlimited free; private: 2000 free min/mo) — never on cluster nodes. -In-cluster pipelines are reserved for cluster-touching steps only: Woodpecker -deploys (`kubectl set image`), terragrunt applies, certbot. Do not -(re)introduce in-cluster image builds or CI test runs — the fallback-build -pattern was deliberately removed (clean cut). **Watch what you trigger**: -after any push that fires a build chain, monitor it to completion (GHA run → -Woodpecker deploy → `rollout status`) and fix failures immediately; verify -via live state, not the checkmark. Fleet migration: PRD infra#10 (ADR-0002). +**Doctrine (ADR-0002, fleet-wide as of 2026-06-13): ALL image builds + CI +compute run OFF-infra.** Every owned image is built/linted/tested on GitHub +Actions (public repos: free; private: 2000 free min/mo) and pushed to +`ghcr.io/viktorbarzin/`. **No in-cluster image builds or CI test runs +exist anywhere** — the in-cluster Woodpecker buildkit and the fallback-build +pattern were removed (clean cut). Woodpecker is **deploy-only** (plus infra +applies + maintenance crons). Canonical CI/CD reference: +`docs/architecture/ci-cd.md`; decision: `docs/adr/0002-all-image-builds-off-infra-gha-ghcr.md`. +**Watch what you trigger**: after a push that fires a build chain, follow it to +completion (GHA run → Woodpecker deploy → `rollout status`) and fix failures; +verify via live state, not the checkmark. -**Owned-app deploy model (build triggers the rollout — 2026-06-02):** For -self-hosted apps **we build** (Forgejo `viktor/` + Dockerfile + -`.woodpecker.yml`), the build pipeline ALSO drives the rollout — atomic + -deterministic, no wait for Keel's poll. Pattern (`build-and-push` tags `latest` -+ `${CI_COMMIT_SHA:0:8}`, then a `deploy` step): `kubectl set image -deployment/ =:${CI_COMMIT_SHA:0:8} -n ` + -`kubectl rollout status ... --timeout=300s`. The `woodpecker-agent` SA is -`cluster-admin`, so the `bitnami/kubectl` step needs no kubeconfig/RBAC (uses -its in-cluster SA). **Keel stays enrolled in parallel** as a redundant net -(finds the deployed SHA already running → no-op). Requires the Deployment to -have `ignore_changes` on `…container[0].image` (KEEL_IGNORE_IMAGE) so CI -`set image` doesn't fight `terragrunt apply`. CronJobs in owned apps use -`:latest` + `imagePullPolicy: Always` (fresh pod each run) instead of a deploy -step. **Never** `set image`/`rollout restart` operator-managed StatefulSets -(memory id=740). Reference impls: `tuya_bridge/.woodpecker.yml`, -`job-hunter`, `f1-stream` (viktor/f1-stream, extracted from this monorepo -2026-06-05). This reverses decision #12 of -`docs/plans/2026-05-16-auto-upgrade-apps-design.md` for owned (not upstream) -images. +**The fleet pattern (every owned app):** Forgejo `viktor/` (canonical) +push-mirrors (`sync_on_commit`) → GitHub `ViktorBarzin/` → GHA +`.github/workflows/build.yml` (committed on Forgejo, mirrors over): `on: push: +branches:[master]` ONLY (feature branches mirror but build/deploy nothing — the +safety valve). The `build` job: lint/test → `svu` cuts the next `vX.Y.Z` tag to +CANONICAL Forgejo (GHA secret `FORGEJO_GIT_TOKEN` = write:repository PAT) + bakes +`VERSION` → `buildx` `linux/amd64` `provenance:false` (single-manifest, dodges +the orphaned-index-children class) → push `ghcr.io/viktorbarzin/:` + +`:latest` → `delete-package-versions` keep-10. The `deploy` job POSTs +`ci.viktorbarzin.me/api/repos//pipelines` (the GitHub-mirror's Woodpecker +registration, github-forge; GHA secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + +`IMAGE_NAME` → `.woodpecker/deploy.yml` (event:**manual** ONLY, so the raw +Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set +image deployment/ …` in-cluster (woodpecker-agent SA = cluster-admin, no +kubeconfig). Deployment image is `ignore_changes`/KEEL_IGNORE_IMAGE so the SHA +sticks vs `terragrunt apply`; CronJobs track `:latest` + `imagePullPolicy: +Always`. **Keel stays enrolled** as a redundant net (sees the SHA already +running → no-op). **Never** `set image`/`rollout restart` operator-managed +StatefulSets (memory id=740). Onboarding tool: `scripts/offinfra-onboard` + +`scripts/offinfra-templates/`; mirror + workflow commits via the Forgejo API over +the internal Traefik LB (`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`). +Reference impls: tripit (the original pilot), f1-stream, job-hunter, tuya_bridge. -**Flow (GHA-migrated apps)**: `git push → GHA build+push DockerHub (8-char SHA) → POST Woodpecker API → kubectl set image` +**Migrated apps (issues #13–#27):** f1-stream, job-hunter, tuya_bridge, +beadboard, nextcloud-todos, claude-agent-service, **claude-memory-mcp** (GHA → +ghcr, NOT DockerHub), kms-website, Freedify, instagram-poster, payslip-ingest, +broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder, +x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website, +k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, +audiobook-search, council-complaints) now also land on ghcr. +- **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service, + claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, + chrome-service-novnc, android-emulator. +- **PRIVATE ghcr:** f1-stream, job-hunter, instagram-poster, payslip-ingest, + wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, + infra-ci. Pulled via the Kyverno-synced `ghcr-credentials` allowlist + (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; NOT cluster-wide; cred + = Vault `secret/viktor/ghcr_pull_token`, an alias of the admin `github_pat` — + GitHub has no token-mint API, swap the alias value if a scoped token is ever + UI-minted). -**Migrated to GHA** (9): Website, k8s-portal, claude-memory-mcp, apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints -**Woodpecker-native owned-app build** (Forgejo registry, build->deploy in one `.woodpecker.yml`): tuya_bridge, job-hunter, f1-stream (extracted to viktor/f1-stream 2026-06-05; Woodpecker repo id 166; the old github source is archived + its GHA repo-id-10 deactivated) -**Woodpecker-only**: travel_blog (1.4GB content too large for GHA), infra pipelines (terragrunt apply, certbot, build-cli — need cluster access) -**Private Forgejo repo → off-infra GHA → GHCR** (NEW 2026-06-09 — gentler builds: keeps build IO **and** the registry push OFF the homelab/sdc; replaces in-cluster Woodpecker buildkit for private repos): **tripit** is the pilot. Forgejo `viktor/tripit` (canonical) push-mirrors → PRIVATE `ViktorBarzin/tripit` GitHub repo (`sync_on_commit`); `.github/workflows/build.yml` (committed on Forgejo, mirrors over) builds + pushes `ghcr.io/viktorbarzin/tripit:+latest` on GHA (free, ~2min, GHA-native cache). Cluster pulls of PRIVATE ghcr images use the `ghcr-credentials` dockerconfigjson, cloned by the kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit ALLOWLIST of private-ghcr namespaces only (ADR-0002; source `stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; cred = Vault `secret/viktor/ghcr_pull_token`, currently an alias of the admin `github_pat` — GitHub has no token-mint API, swap the alias value if a scoped token is ever UI-minted). **Auto-deploy** (verified 2026-06-09): the GHA `deploy` job POSTs `ci.viktorbarzin.me/api/repos/167/pipelines` (Woodpecker repo **167** = the GitHub mirror, registered github-forge; GHA secret `WOODPECKER_TOKEN`) with `IMAGE_TAG`+`IMAGE_NAME` → `.woodpecker/deploy.yml` (event:**manual** ONLY, so the Forgejo→GitHub mirror's raw pushes don't fire a tag-less deploy) runs `kubectl set image deployment/tripit tripit=… alembic-migrate=…` in-cluster (woodpecker-agent SA = cluster-admin, no kubeconfig). Image is KEEL_IGNORE_IMAGE so the SHA tag sticks; worker CronJobs track `:latest`. **Semver** (parallel layer): the GHA `build` job runs `svu` v3.4.1 over conventional commits, auto-cuts the next `vX.Y.Z` git tag pushed to CANONICAL Forgejo (GHA secret `FORGEJO_GIT_TOKEN` = write:repository PAT, NOT the package-scoped push token) and bakes `VERSION` → app reports it at `/api/version` (verified 0.2.1). Deploy tag stays the 8-char SHA. The old in-cluster `.woodpecker/build.yml` was DELETED (only `.woodpecker/deploy.yml` remains). GitHub default branch must be `master`. **Replicate to f1-stream, tuya_bridge, job-hunter** (currently Woodpecker-native in-cluster builds). Mirror + workflow-file commits are done via the Forgejo API over the internal Traefik LB (`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm can't reach forgejo's public hairpin. +**Infra-owned images (issues #29/#30)** build on GHA workflows IN the infra +repo's own `.github/workflows/` (added to the GitHub lineage via PR; the +github↔forgejo divergence was deliberately NOT reconciled): +`build-chrome-service-novnc.yml` + `build-android-emulator.yml` → public ghcr; +`build-cli.yml` → DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli`; +`build-infra-ci.yml` → `ghcr.io/viktorbarzin/infra-ci`. **infra-ci** is the image +the `.woodpecker/default.yml` apply step + `drift-detection.yml` run in (proven +by pipelines 165/166). chatterbox-tts is already built by tripit's GHA → ghcr. +The Woodpecker `build-ci-image.yml` + `build-cli.yml` pipelines were REMOVED; +infra-ci break-glass is a manual `.woodpecker/breakglass-infra-ci.yml` (ghcr +pull-and-save to the registry VM). -**Per-project files**: -- `.github/workflows/build-and-deploy.yml` — GHA: checkout, build, push DockerHub, POST Woodpecker API -- `.woodpecker/deploy.yml` — Woodpecker: `kubectl set image` + Slack notify (event: `[manual, push]`) -- `.woodpecker/build-fallback.yml` — Old full build pipeline preserved (event: `deployment` — never auto-fires) +**Forgejo container registry: FROZEN + emptied** (issue #32 wiped all `viktor/*` +container packages). Break-glass-only now; nothing pushes. `forgejo-cleanup` +stays DRY_RUN. Pull-through caches on `10.0.20.10` are unchanged. Runbook: +`docs/runbooks/forgejo-registry-breakglass.md`. -**Woodpecker API**: Uses **numeric repo IDs** (`/api/repos/2/pipelines`), NOT owner/name paths (those return HTML). -Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handler=6, audiblez-web=9, plotting-book=43, claude-memory-mcp=78, infra-onboarding=79, council-complaints=TBD (f1-stream's old GHA-era github repo id 10 is deactivated; it's now a Woodpecker-native Forgejo build at repo id 166) +**Woodpecker now runs only:** per-app `deploy.yml` (manual, `kubectl set +image`), `default.yml` (terragrunt apply), `renew-tls.yml` (certbot), +maintenance crons (drift-detection, provision-user, registry-config-sync, +pve-nfs-exports-sync, issue-automation, postmortem-todos, k8s-portal), and the +manual `breakglass-infra-ci.yml`. **No build/test pipeline on any repo — do not +(re)introduce one.** + +**Decommissioned (issue #31):** travel_blog (stack destroyed + dir removed), 6 +dead builders' pipelines (terminal-lobby, webhook-handler, hmrc-sync, +trading-bot, travel-agent, trip-planner), and all `build-fallback.yml` files +(only Website had one). + +**Woodpecker API**: numeric repo IDs (`/api/repos//pipelines`), NOT +owner/name (those return HTML). The deploy registration for each app is the +**GitHub mirror** repo (github-forge). Infra: Forgejo forge = repo 82, legacy +GitHub forge = repo 1. **Woodpecker YAML gotchas**: - Commands with `${VAR}:${VAR}` must be **quoted** — unquoted `:` triggers YAML map parsing when vars are empty - Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility issues) - Global secrets must have `manual` in their events list for API-triggered pipelines -**GitHub repo secrets** (set on all repos): `DOCKERHUB_USERNAME`, `DOCKERHUB_TOKEN`, `WOODPECKER_TOKEN` - -**Infra pipelines unchanged**: `default.yml` (terragrunt apply), `renew-tls.yml` (certbot cron), `build-cli.yml` (dual registry push), `k8s-portal.yml` (path-filtered build), `provision-user.yml` — all stay on Woodpecker. +**GitHub repo secrets** (per repo): `WOODPECKER_TOKEN` (POST deploy pipeline), +`FORGEJO_GIT_TOKEN` (write:repository PAT for the svu tag push). ghcr push uses +the workflow's built-in `GITHUB_TOKEN` (`packages: write`). ## Database Host diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index 632505c0..ec78beac 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -47,7 +47,7 @@ | nextcloud | File sync/share | nextcloud | | calibre | E-book management (may be merged into ebooks stack) | calibre | | onlyoffice | Document editing | onlyoffice | -| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05), Woodpecker-native build->deploy (repo id 166) | f1-stream | +| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); canonical source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05); GHA-built → `ghcr.io/viktorbarzin/f1-stream` (private), Woodpecker deploy-only (ADR-0002) | f1-stream | | chrome-service | Headed Chromium over CDP (`http://chrome-service.chrome-service.svc:9222`, `connect_over_cdp`; legacy `:3000/` WS pool removed 2026-06-04) for sibling services driving anti-bot pages — snapshot-harvester CronJob + tripit fare scrape | chrome-service | | rybbit | Analytics | rybbit | | isponsorblocktv | SponsorBlock for TV | isponsorblocktv | diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index e44df43d..c4493f86 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -2,334 +2,374 @@ ## Overview -The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`. +**Doctrine (ADR-0002): all image builds and CI compute run OFF-infra.** Every +owned image is built, tested, and linted on **GitHub Actions** (free on public +repos; 2000 free min/mo on private) and pushed to **`ghcr.io/viktorbarzin/`**. +Woodpecker is **deploy-only** — a GHA job POSTs its API with the freshly-built +image tag and Woodpecker runs `kubectl set image` from inside the cluster. +There are **no in-cluster image builds or CI test runs anywhere** — the +in-cluster Woodpecker buildkit and the fallback-build pattern were removed as a +clean cut (ADR-0002, 2026-06-13). The Forgejo container registry is **frozen +and emptied** — break-glass only. + +This breaks the old circular dependency (images needed to repair the cluster +used to be built and stored *inside* it) and keeps build IO + registry pushes +off the homelab spindle. ## Architecture Diagram ```mermaid graph LR - A[Git Push] --> B[GitHub Actions] - B --> C[Build Docker Image
linux/amd64, 8-char SHA tag] - C --> D[Push to DockerHub] - D --> E[POST Woodpecker API] - E --> F[Woodpecker Pipeline] - F --> G[Vault K8s Auth
SA JWT] - G --> H[kubectl set image] - H --> I[K8s Deployment] - I --> J[Pull from DockerHub
or Pull-Through Cache] + A[git push Forgejo
viktor/<repo> canonical] --> B[push-mirror sync_on_commit] + B --> C[GitHub mirror
ViktorBarzin/<repo>] + C --> D[GitHub Actions
.github/workflows/build.yml] + D --> E[lint / test] + E --> F[buildx linux/amd64
provenance:false] + F --> G[push ghcr.io/viktorbarzin/<name>
:sha8 + :latest] + G --> H[svu tag -> Forgejo canonical] + G --> I[POST Woodpecker deploy repo] + I --> J[.woodpecker/deploy.yml
event: manual] + J --> K[kubectl set image
in-cluster SA cluster-admin] + K --> L[K8s Deployment
pulls from ghcr] - K[Pull-Through Cache
10.0.20.10] -.-> J - L[forgejo.viktorbarzin.me
Private Registry on Forgejo] -.-> J - - style B fill:#2088ff - style F fill:#4c9e47 - style K fill:#f39c12 + style D fill:#2088ff + style J fill:#4c9e47 + style G fill:#f39c12 ``` ## Components -| Component | Version | Location | Purpose | -|-----------|---------|----------|---------| -| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub | -| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster | -| DockerHub | Cloud | `viktorbarzin/*` | Public image registry | -| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 | -| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)
`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries | -| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces | -| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines | +| Component | Location | Purpose | +|-----------|----------|---------| +| GitHub Actions | `.github/workflows/build.yml` (per repo) | Build + lint + test + push image; trigger deploy; cut semver tag | +| ghcr.io | `ghcr.io/viktorbarzin/*` | Container registry for ALL owned images (public + private packages) | +| Woodpecker CI | `ci.viktorbarzin.me` | **Deploy-only** — `kubectl set image` in-cluster; plus infra applies + maintenance crons | +| Forgejo | `forgejo.viktorbarzin.me/viktor/` | **Canonical** git source (push-mirrors to GitHub). Container registry **FROZEN** (break-glass only) | +| Pull-Through Cache | `10.0.20.10:5000/5010/5020/5030/5040` | LAN cache for upstream registries (DockerHub, ghcr, Quay, k8s.gcr, Kyverno) | +| Kyverno | `kyverno` namespace | Syncs `ghcr-credentials` (private-ghcr allowlist) + `registry-credentials` to namespaces | +| Vault | `vault.viktorbarzin.me` | K8s auth for Woodpecker deploy pipelines; CI tokens in `secret/ci/global` + `secret/viktor` | ## How It Works -### Build Flow (GitHub Actions) +### The fleet pattern (every owned app) -1. **Trigger**: Git push to main/master branch -2. **Build**: GHA builds Docker image for `linux/amd64` platform only -3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`) - - `:latest` tags are **never used** to prevent stale pull-through cache issues -4. **Push**: Image pushed to DockerHub public registry -5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA +1. **Canonical source = Forgejo** `viktor/`. A **push-mirror** + (`sync_on_commit`) pushes every commit to the GitHub mirror + `ViktorBarzin/`. The `.github/workflows/build.yml` is committed on + Forgejo and mirrors over. +2. **GHA `build` job** (triggers `on: push: branches: [master]` ONLY — feature + branches mirror but build/deploy nothing, the safety valve): + - lint + test + - `svu` computes the next `vX.Y.Z` from conventional commits and pushes the + tag back to **canonical Forgejo** (GHA secret `FORGEJO_GIT_TOKEN` = + write:repository PAT); `VERSION` is baked into the image + - `docker buildx` `linux/amd64`, **`provenance: false`** (single-manifest — + avoids the orphaned-index-children failure class), push + `ghcr.io/viktorbarzin/:` + `:latest` + - `delete-package-versions` keeps the newest ~10 ghcr versions +3. **GHA `deploy` job** POSTs `ci.viktorbarzin.me/api/repos//pipelines` + (the Woodpecker registration for the **GitHub mirror**, github-forge; GHA + secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + `IMAGE_NAME`. +4. **`.woodpecker/deploy.yml`** (event: **manual** only, so the raw + Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set + image deployment/ =` in-cluster. The `woodpecker-agent` + SA is `cluster-admin`, so the `bitnami/kubectl` step needs no + kubeconfig/RBAC. The Deployment image is in `lifecycle.ignore_changes` + (`KEEL_IGNORE_IMAGE`) so the SHA tag sticks and `terragrunt apply` doesn't + fight it. CronJobs in owned apps track `:latest` + `imagePullPolicy: Always` + instead of a deploy step. -### Deploy Flow (Woodpecker CI) +**Keel stays enrolled** as a redundant net (finds the deployed SHA already +running → no-op). -1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA -2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth -3. **Deploy**: `kubectl set image deployment/ =viktorbarzin/:` -4. **Notify**: Slack notification on success/failure +**Tooling**: `infra/scripts/offinfra-onboard` + `infra/scripts/offinfra-templates/` +scaffold a repo onto this pattern (mirror, workflow, Woodpecker deploy repo, +old-pipeline removal, default-branch flip). Mirror + workflow commits go via +the Forgejo API over the internal Traefik LB +(`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm +can't reach Forgejo's public hairpin. -### Project Migration Status +### ghcr package visibility -**Migrated to GHA (8 projects)**: -- Website -- k8s-portal -- claude-memory-mcp -- apple-health-data -- audiblez-web -- plotting-book -- insta2spotify -- book-search (audiobook-search) +| Visibility | Packages | Pull mechanism | +|------------|----------|----------------| +| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous | +| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson | -**Woodpecker-native owned-app builds** (build + push to the Forgejo private -registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel -stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`. -`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on -2026-06-05 (Woodpecker repo id 166); the old github source is archived and its -GHA-era Woodpecker repo (id 10) is deactivated. +Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the +kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit +**ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source +`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault +`secret/viktor/ghcr_pull_token` (an alias of the admin `github_pat` — GitHub +has no token-mint API; swap the alias value if a scoped token is ever +UI-minted). -**Woodpecker-only (infra + large apps)**: -- `travel_blog`: 5.7GB content directory exceeds GHA limits -- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli) +### Migrated apps (issues #13–#27) -### Woodpecker Pipeline Files +f1-stream, job-hunter, tuya_bridge, beadboard, nextcloud-todos, +claude-agent-service, claude-memory-mcp, kms-website, Freedify, +instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`), +fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original +pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website, +k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, +audiobook-search, council-complaints) now also land on ghcr. -Each project contains: -- `.woodpecker/deploy.yml`: kubectl set image + Slack notification -- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires) +### Infra-owned images (issues #29 / #30) -### Woodpecker Repository IDs +Images owned by the infra repo build on GHA workflows **in the infra repo's own +`.github/workflows/`** (the github↔forgejo divergence was deliberately NOT +reconciled — the workflows were added to the GitHub lineage via PR): -Woodpecker API uses numeric IDs (not owner/name): +| Image | Workflow | Destination | +|-------|----------|-------------| +| chrome-service-novnc | `build-chrome-service-novnc.yml` | public `ghcr.io/viktorbarzin/chrome-service-novnc` | +| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` | +| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` | +| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` | -| Repo | ID | -|------|------| -| infra | 1 | -| Website | 2 | -| finance | 3 | -| health | 4 | -| travel_blog | 5 | -| webhook-handler | 6 | -| audiblez-web | 9 | -| plotting-book | 43 | -| claude-memory-mcp | 78 | -| infra-onboarding | 79 | +**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and +`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is +already built by tripit's GHA → ghcr. -### Image Registry Flow +The Woodpecker `build-ci-image.yml` and `build-cli.yml` pipelines were +**REMOVED**. Break-glass for infra-ci is now a manual +`.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM). -1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10` -2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss -3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access -4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries. -5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem). +### Forgejo container registry — FROZEN -### Infra Pipelines (Woodpecker-only) +Issue #32 wiped all `viktor/*` container packages (~19G reclaimed, `/data` +58%→20%). The registry is **break-glass-only** now; nothing pushes to it. The +`forgejo-cleanup` CronJob stays in `DRY_RUN` (nothing to clean). Pull-through +caches on the registry VM (`10.0.20.10`) are unchanged. See +`docs/runbooks/forgejo-registry-breakglass.md`. + +### Image registry / pull path + +1. **Containerd `hosts.toml`** redirects pulls from docker.io and ghcr.io to the + pull-through cache at `10.0.20.10` (5000 = docker.io, 5010 = ghcr.io). +2. **Pull-through cache** serves cached images from the LAN, fetches upstream on + a miss. +3. **Kyverno ClusterPolicies** sync `ghcr-credentials` (private-ghcr allowlist) + and `registry-credentials` to namespaces. + +## Woodpecker — what it still runs + +Woodpecker is **deploy + cluster-touching steps only**: | Pipeline | File | Purpose | |----------|------|---------| -| default | `.woodpecker/default.yml` | Terragrunt apply on push | -| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron | -| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries | -| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes | -| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory | -| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` | -| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host | -| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent | -| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection | -| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues | +| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) | +| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) | +| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron | +| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) | | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec | +| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change | +| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE | +| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues | +| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new post-mortems | +| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered deploy for the portal | +| breakglass-infra-ci | `.woodpecker/breakglass-infra-ci.yml` | **Manual** ghcr pull-and-save of infra-ci to the registry VM | + +**No build/test pipeline exists on any repo.** Do not (re)introduce one. + +### Woodpecker API + +Uses **numeric repo IDs** (`/api/repos//pipelines`), NOT owner/name paths +(those return HTML). The deploy registration for each app is the **GitHub +mirror** repo (registered github-forge). IDs are stable across renames and must +be looked up from the Woodpecker UI/DB. + +### Woodpecker YAML gotchas + +- Commands with `${VAR}:${VAR}` must be **quoted** — an unquoted `:` triggers + YAML map parsing when the vars are empty. +- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility). +- Global secrets must include `manual` in their events list for API-triggered + pipelines. + +### GitHub repo secrets + +Per repo: `WOODPECKER_TOKEN` (POST the deploy pipeline), `FORGEJO_GIT_TOKEN` +(write:repository PAT for the `svu` tag push). ghcr push uses the workflow's +built-in `GITHUB_TOKEN` (`packages: write`). + +## Infra repo CI topology + +The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo +forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id +1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml` +(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push` +Slack audit step. Operational facts (2026-06-10): + +- **Webhook URL is the IN-CLUSTER service**: + `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed + via the Forgejo API). The Woodpecker default (`https://ci.viktorbarzin.me/...`) + resolves to the non-proxied public A record from pods → NAT hairpin → + intermittent `context deadline exceeded`, silently dropping push events. If + Woodpecker "repairs" the repo it rewrites the hook back to `ci.viktorbarzin.me` + — re-apply the in-cluster URL. +- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference + repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, …). + When registering a new forge repo for infra, clone the secret set too. +- **Empty commits defeat path filters**: a commit with no changed files makes + Woodpecker include ALL workflow files (path conditions can't exclude), so every + repo secret must resolve. Normal commits with real files only compile the + matching workflows. + +The Forgejo trigger is not fully dependable — land infra changes by pushing +Forgejo master (as viktor), use `[ci skip]` for docs/no-op commits, and verify +deploys via `scripts/tg` + live cluster state rather than trusting the CI +checkmark. The two remotes have **diverged** (parallel histories under +different SHAs); expect github pushes to reject non-fast-forward and leave them +— never force-push. ## Configuration -### GitHub Actions - -**File**: `.github/workflows/build-and-deploy.yml` +### GitHub Actions (per-app `.github/workflows/build.yml`) ```yaml -name: Build and Deploy +name: build on: push: - branches: [main, master] + branches: [master] jobs: build: runs-on: ubuntu-latest + permissions: + contents: write # svu tag push + packages: write # ghcr push steps: - - name: Build Docker image - run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} . - - name: Push to DockerHub - run: docker push viktorbarzin/app:${SHORT_SHA} - - name: Trigger Woodpecker Deploy + - uses: actions/checkout@v4 + - name: lint + test + run: make lint test + - name: svu tag -> Forgejo run: | - curl -X POST https://ci.viktorbarzin.me/api/repos//pipelines \ - -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" + VERSION=$(svu next) + # ... push tag to canonical Forgejo with FORGEJO_GIT_TOKEN + - uses: docker/setup-buildx-action@v3 + - uses: docker/build-push-action@v6 + with: + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/:${{ github.sha }} + ghcr.io/viktorbarzin/:latest + deploy: + needs: build + runs-on: ubuntu-latest + steps: + - name: Trigger Woodpecker deploy + run: | + curl -X POST https://ci.viktorbarzin.me/api/repos//pipelines \ + -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \ + -d '{"branch":"master","variables":{"IMAGE_TAG":"...","IMAGE_NAME":"..."}}' ``` -**Required GitHub Secrets**: -- `DOCKERHUB_USERNAME` -- `DOCKERHUB_TOKEN` -- `WOODPECKER_TOKEN` - -### Woodpecker Deploy Pipeline - -**File**: `.woodpecker/deploy.yml` +### Woodpecker deploy pipeline (per-app `.woodpecker/deploy.yml`) ```yaml when: - event: [deployment] + event: manual steps: deploy: - image: bitnami/kubectl:latest + image: bitnami/kubectl:latest # uses the in-cluster woodpecker-agent SA (cluster-admin) commands: - - kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8} - secrets: [k8s_token] - + - "kubectl set image deployment/app app=${IMAGE_NAME}:${IMAGE_TAG} -n " + - "kubectl rollout status deployment/app -n --timeout=300s" notify: image: plugins/slack - settings: - webhook: ${SLACK_WEBHOOK} when: status: [success, failure] ``` -**YAML Gotchas**: -- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty -- Use `bitnami/kubectl:latest` (not pinned versions) -- Global secrets must be manually added to `secrets:` list in pipeline +### CI/CD secrets sync -### Vault Configuration - -**K8s Auth for Woodpecker**: -- Woodpecker pipelines authenticate using ServiceAccount JWT -- Vault K8s auth mount validates JWT and issues token -- Policies grant access to secrets and dynamic credentials - -### CI/CD Secrets Sync - -**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours -- Keeps Woodpecker global secrets in sync with Vault -- Runs in `woodpecker` namespace - -## Infra repo CI (Woodpecker repo 82 — Forgejo forge) - -The infra repo itself runs on Woodpecker via the **Forgejo** forge (repo id 82, -registered 2026-06-08; the GitHub-side repo id 1 also remains registered). -Pushes to `master` fire `.woodpecker/default.yml` (changed-stacks terragrunt -apply) plus the `notify-nonadmin-push` Slack audit step (allow-then-audit -contribution model — see `multi-tenancy.md`). Operational facts (2026-06-10): - -- **Webhook URL is the IN-CLUSTER service**: `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` - (PATCHed via the Forgejo API). The Woodpecker-generated default - (`https://ci.viktorbarzin.me/...`) resolves to the non-proxied public A - record from pods → NAT hairpin → intermittent `context deadline exceeded`, - silently dropping push events (found when a push produced no pipeline). - If Woodpecker ever "repairs" the repo it will rewrite the hook back to - `ci.viktorbarzin.me` — re-apply the in-cluster URL (or pin `ci.viktorbarzin.me` - in the CoreDNS pod carve-out alongside forgejo). -- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference - repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, - …). Repo 82 was registered without them and every all-workflow compile - errored with `secret "registry_ssh_key" not found`. Fixed by cloning repo-1 - rows to repo 82 in the Woodpecker DB (`insert into secrets … select … where - repo_id=1`). When registering a new forge repo for infra, clone the secret - set too. -- **Empty commits defeat path filters**: a commit with no changed files makes - Woodpecker include ALL workflow files (path conditions can't exclude), so - every repo secret must resolve. Normal commits with real files only compile - the matching workflows. +A CronJob in the `woodpecker` namespace pushes `secret/ci/global` from Vault → +the Woodpecker API every 6h, keeping global secrets in sync. Woodpecker deploy +pipelines authenticate to the cluster via the in-cluster `woodpecker-agent` SA +(cluster-admin); Vault K8s auth backs any secret reads. ## Decisions & Rationale -### Why GitHub Actions + Woodpecker? +### Why all builds off-infra (ADR-0002)? -**Alternatives considered**: -1. **Woodpecker-only**: Simple, but wastes cluster resources on builds -2. **GHA-only**: No cluster access, requires kubectl from outside (security risk) -3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access) +- **Breaks the circular dependency** — the images needed to repair the cluster + no longer live inside it (they're on ghcr, an external registry). +- **Removes build IO + registry push load** from the contended homelab spindle. +- GHA is free on public repos and generous on private; buildx provenance:false + sidesteps the orphaned-index-children failure class that plagued the + in-cluster registry. +- **Clean cut** — no in-cluster fallback builds anywhere; one pattern, + fleet-wide. -**Benefits**: -- Free compute for builds on public repos -- Cluster access stays internal (Woodpecker has direct K8s access) -- Separation of concerns: build vs deploy +### Why ghcr (not push back to Forgejo)? -### Why 8-Character SHA Tags (Not :latest)? +Forgejo's container registry repeatedly orphaned OCI index children +(2026-04-13/19, 2026-05-04, 2026-06-10) and its retention is not container-aware. +ghcr is external (DR-safe), free for this scale, and has native multi-arch +handling. The Forgejo registry was frozen + emptied (issue #32). -- Pull-through cache serves stale `:latest` tags indefinitely -- SHA tags ensure every deployment pulls the correct image -- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations) +### Why Woodpecker stays for deploy? -### Why Numeric Repo IDs for Woodpecker API? +`kubectl set image` needs in-cluster privileged access; doing it from GHA would +mean exposing kube-apiserver or a long-lived kubeconfig. Woodpecker's +`woodpecker-agent` SA is already cluster-admin in-cluster — the deploy step +needs no credentials. -- Woodpecker API requires numeric IDs (not owner/name slugs) -- IDs are stable across repo renames -- Must be manually looked up from Woodpecker UI or database +### Why `event: manual` on deploy.yml? -### Why linux/amd64 Only? +The Forgejo→GitHub push-mirror sends raw, tag-less pushes to the GitHub mirror. +If `deploy.yml` fired on `push`, every mirror sync would trigger a deploy with no +image tag. `manual` means only the GHA `deploy` job's explicit API POST (with +`IMAGE_TAG`) deploys. -- Cluster runs on x86_64 nodes only -- ARM builds would waste time and storage -- Multi-arch images add complexity without benefit +### Why linux/amd64 only? + +The cluster runs on x86_64 nodes only; ARM builds waste time and storage. ## Troubleshooting -### GHA Build Fails: "denied: requested access to the resource is denied" +### GHA build fails: ghcr push "denied" -**Cause**: DockerHub credentials expired or incorrect +The workflow `GITHUB_TOKEN` needs `packages: write` permission and the package +must allow the repo to push. Check the workflow `permissions:` block and the +package's "Manage Actions access" settings. + +### Image pull fails: "ErrImagePull" / "ImagePullBackOff" -**Fix**: ```bash -# Regenerate DockerHub token -# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN +# Public image — check the pull-through cache is up +curl http://10.0.20.10:5010/v2/_catalog + +# Private image — verify the ghcr-credentials Secret exists in the namespace +kubectl get secret ghcr-credentials -n +# It's Kyverno-synced to an allowlist; if missing, the namespace isn't on the +# allowlist in stacks/kyverno/modules/kyverno/ghcr-credentials.tf ``` -### Woodpecker Deploy Fails: "Unauthorized" +If the cause is the internal-DNS hairpin (fresh pulls timing out on the public +Forgejo path), see the CoreDNS `viktorbarzin.me` carve-out in +`docs/architecture/networking.md` and `docs/runbooks/registry-vm.md`. -**Cause**: Vault K8s auth token expired or invalid +### Deploy didn't happen after a push -**Fix**: -```bash -# Restart Woodpecker pipeline (token auto-renewed) -# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer -``` +Confirm the push was to **master** (feature branches build/deploy nothing). +Check the GHA run completed the `deploy` job, then check Woodpecker received the +manual pipeline (`ci.viktorbarzin.me`, the GitHub-mirror deploy repo). Verify +live with `kubectl rollout status` — not the CI checkmark. -### Image Pull Fails: "ErrImagePull" +### Woodpecker deploy fails: "YAML: did not find expected key" -**Cause**: Pull-through cache or registry credentials issue - -**Fix**: -```bash -# Check pull-through cache is running -curl http://10.0.20.10:5000/v2/_catalog - -# Verify registry-credentials Secret exists in namespace -kubectl get secret registry-credentials -n - -# Manually sync credentials if missing -kubectl get secret registry-credentials -n default -o yaml | \ - sed 's/namespace: default/namespace: /' | kubectl apply -f - -``` - -### Woodpecker Pipeline: "YAML: did not find expected key" - -**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty - -**Fix**: Quote the command: -```yaml -commands: - - "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}" -``` - -### travel_blog Build Times Out on GHA - -**Cause**: 5.7GB content directory exceeds GHA disk/time limits - -**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources. - -### CI/CD Secrets Out of Sync - -**Cause**: CronJob failed to sync Vault → Woodpecker - -**Fix**: -```bash -# Check CronJob status -kubectl get cronjob -n woodpecker - -# Manually trigger sync -kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker -``` +Unquoted command with `${VAR}:${VAR}` syntax when a VAR is empty. Quote the +command (see the deploy.yml example above). ## Related -- [Databases Architecture](./databases.md) — Database credentials via Vault -- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access -- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app -- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues -- Vault documentation: K8s auth configuration -- Woodpecker documentation: API reference +- ADR: `../adr/0002-all-image-builds-off-infra-gha-ghcr.md` — the decision +- [Databases Architecture](./databases.md) — database credentials via Vault +- [Multi-Tenancy](./multi-tenancy.md) — per-user Woodpecker access +- Runbook: `../runbooks/forgejo-registry-breakglass.md` — using the frozen registry +- Runbook: `../runbooks/registry-vm.md` — pull-through cache VM + image-pull debugging +- Onboarding tool: `../../scripts/offinfra-onboard` + `../../scripts/offinfra-templates/` diff --git a/stacks/android-emulator/variables.tf b/stacks/android-emulator/variables.tf index bcc24a0d..822b7527 100644 --- a/stacks/android-emulator/variables.tf +++ b/stacks/android-emulator/variables.tf @@ -6,5 +6,5 @@ variable "tls_secret_name" { variable "image_tag" { type = string default = "latest" - description = "android-emulator image tag at forgejo.viktorbarzin.me/viktor/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) -> ghcr.io/viktorbarzin/android-emulator on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build." + description = "android-emulator image tag at ghcr.io/viktorbarzin/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build." } diff --git a/stacks/terminal/main.tf b/stacks/terminal/main.tf index c2f3f50b..3737817d 100644 --- a/stacks/terminal/main.tf +++ b/stacks/terminal/main.tf @@ -225,8 +225,11 @@ module "ingress_ro" { # https://forgejo.viktorbarzin.me/viktor/terminal-lobby # # That repo's ./scripts/deploy.sh ships everything to wizard@10.0.10.10 -# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. This stack -# only owns the Kubernetes side: Services, Endpoints pointing at +# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. Deploy is +# MANUAL via that script — there is no CI pipeline (the lobby's +# .woodpecker.yml was removed under ADR-0002, issue #31; it builds no +# image, so it is not part of the GHA->ghcr fleet). This stack only owns +# the Kubernetes side: Services, Endpoints pointing at # 10.0.10.10:{7681,7682,7683,7684}, the IngressRoutes, and the Traefik # middlewares that gate everything behind Authentik forward-auth. # diff --git a/stacks/tuya-bridge/variables.tf b/stacks/tuya-bridge/variables.tf index 5c2be4d3..58e0a005 100644 --- a/stacks/tuya-bridge/variables.tf +++ b/stacks/tuya-bridge/variables.tf @@ -6,5 +6,5 @@ variable "tls_secret_name" { variable "image_tag" { type = string default = "latest" - description = "tuya_bridge image tag pushed to forgejo.viktorbarzin.me/viktor/tuya_bridge. Each Woodpecker run does `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)." + description = "tuya_bridge image tag at ghcr.io/viktorbarzin/tuya_bridge (built by GHA, ADR-0002). The GHA deploy job drives a Woodpecker `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)." } From bda1bdcbf340adf30a2111c8905731d38ba5aac1 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 14:02:54 +0000 Subject: [PATCH 03/36] dbaas: widen backup CronJob startingDeadlineSeconds from 10s to 600s The daily full PostgreSQL backup silently skipped its 2026-06-13 00:00 run, leaving the last full dump 37h old and firing the critical PostgreSQLBackupStale alert. Root cause: startingDeadlineSeconds was 10s on all four dbaas backup CronJobs, so when the CronJob controller was more than 10s late to the midnight tick (many IO-heavy backups all fire at 00:00, the known etcd-starvation window) the run was dropped entirely instead of starting late. 600s lets a brief controller lag still launch the job. Applied to all four (mysql + pg, full + per-db) since they share the footgun and the midnight contention. Co-Authored-By: Claude Opus 4.8 --- stacks/dbaas/modules/dbaas/main.tf | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index 3fc44f94..9d450689 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -427,7 +427,7 @@ resource "kubernetes_cron_job_v1" "mysql-backup" { failed_jobs_history_limit = 5 schedule = "30 0 * * *" # schedule = "* * * * *" - starting_deadline_seconds = 10 + starting_deadline_seconds = 600 successful_jobs_history_limit = 10 job_template { metadata {} @@ -519,7 +519,7 @@ resource "kubernetes_cron_job_v1" "mysql-backup-per-db" { concurrency_policy = "Replace" failed_jobs_history_limit = 3 schedule = "45 0 * * *" - starting_deadline_seconds = 10 + starting_deadline_seconds = 600 successful_jobs_history_limit = 3 job_template { metadata {} @@ -1607,7 +1607,7 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" { failed_jobs_history_limit = 5 schedule = "0 0 * * *" # schedule = "* * * * *" - starting_deadline_seconds = 10 + starting_deadline_seconds = 600 successful_jobs_history_limit = 10 job_template { metadata {} @@ -1695,7 +1695,7 @@ resource "kubernetes_cron_job_v1" "postgresql-backup-per-db" { concurrency_policy = "Replace" failed_jobs_history_limit = 3 schedule = "15 0 * * *" - starting_deadline_seconds = 10 + starting_deadline_seconds = 600 successful_jobs_history_limit = 3 job_template { metadata {} From ff3cc44a2964b526dcc3f91adca278c38dfea7f5 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 14:02:55 +0000 Subject: [PATCH 04/36] forgejo: raise memory limit from 3Gi to 6Gi (OOMKilled at 3Gi) Forgejo OOMKilled twice on 2026-06-13 at the 3Gi cap (exit 137), briefly taking the git remote and OCI registry down and spiking ingress TTFB to 4.7s and the 4xx rate to 51%. Steady-state is ~2.2Gi but it spiked into the cap (true demand above 3.2Gi). The 2026-06-09 bump to 3Gi was sized for tripit buildkit registry pushes, but that driver is gone now that the Forgejo registry was frozen and emptied today (ADR-0002, images on ghcr), so the spike is git ops / the integrity-probe catalog walk / a possible leak. 6Gi gives headroom on the critical git backbone while we watch whether working-set keeps climbing (which would indicate a leak). Co-Authored-By: Claude Opus 4.8 --- stacks/forgejo/main.tf | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/stacks/forgejo/main.tf b/stacks/forgejo/main.tf index e1b8c351..26e317a8 100644 --- a/stacks/forgejo/main.tf +++ b/stacks/forgejo/main.tf @@ -168,19 +168,25 @@ resource "kubernetes_deployment" "forgejo" { name = "data" mount_path = "/data" } - # Bumped 1Gi -> 3Gi 2026-06-09: Forgejo was OOMKilled (exit 137) - # under registry-push load from in-cluster CI builds (tripit - # buildkit pushes large layers into the OCI registry). VPA - # upperBound reads ~1.5Gi, but that's suppressed by the 1Gi cap it - # kept OOMing against — size for the push spike, not steady-state. + # Bumped 1Gi -> 3Gi 2026-06-09, then 3Gi -> 6Gi 2026-06-13. + # OOMKilled again (exit 137) at the 3Gi cap on 2026-06-13 (2 + # restarts; degraded the git backbone + spiked ingress TTFB/4xx). + # Steady-state is ~2.2Gi but it spiked into the 3Gi cap (true + # demand > 3.2Gi, ceiling unknown). The original 6/9 driver (tripit + # buildkit registry pushes) is GONE — the Forgejo container registry + # was frozen + emptied 2026-06-13 (ADR-0002, images moved to ghcr) — + # so the remaining spike is git ops / the integrity-probe catalog + # walk / a possible leak. Sized for generous headroom on the + # critical git remote; if working-set creeps toward 6Gi over days + # that's a leak to fix, not more RAM. # requests=limits (Guaranteed QoS) per the repo memory convention. resources { requests = { cpu = "15m" - memory = "3Gi" + memory = "6Gi" } limits = { - memory = "3Gi" + memory = "6Gi" } } port { From ba72621e52c4f296f5d665ee11afbacfe650d1a9 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 14:13:36 +0000 Subject: [PATCH 05/36] forgejo: 6Gi exceeded namespace quota, set to 4Gi (quota ceiling) The 3Gi->6Gi bump in ff3cc44a was rejected by the forgejo namespace tier-quota (requests.memory capped at 4Gi). With Guaranteed QoS the 6Gi request exceeded quota; FailedCreate left forgejo with 0 pods for ~6 min (git remote + OCI registry outage) until I patched the live Deployment back to a schedulable 4Gi. 4Gi is the most the quota allows and is still a headroom bump over the OOM-prone 3Gi. To go higher the tier-quota must be raised in the same change. This reconciles TF to the live 4Gi so the pending/next apply is a no-op rather than reverting to the quota-busting 6Gi. Co-Authored-By: Claude Opus 4.8 --- stacks/forgejo/main.tf | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/stacks/forgejo/main.tf b/stacks/forgejo/main.tf index 26e317a8..f9adb955 100644 --- a/stacks/forgejo/main.tf +++ b/stacks/forgejo/main.tf @@ -168,25 +168,29 @@ resource "kubernetes_deployment" "forgejo" { name = "data" mount_path = "/data" } - # Bumped 1Gi -> 3Gi 2026-06-09, then 3Gi -> 6Gi 2026-06-13. + # Bumped 1Gi -> 3Gi 2026-06-09, then 3Gi -> 4Gi 2026-06-13. # OOMKilled again (exit 137) at the 3Gi cap on 2026-06-13 (2 - # restarts; degraded the git backbone + spiked ingress TTFB/4xx). - # Steady-state is ~2.2Gi but it spiked into the 3Gi cap (true - # demand > 3.2Gi, ceiling unknown). The original 6/9 driver (tripit - # buildkit registry pushes) is GONE — the Forgejo container registry - # was frozen + emptied 2026-06-13 (ADR-0002, images moved to ghcr) — - # so the remaining spike is git ops / the integrity-probe catalog - # walk / a possible leak. Sized for generous headroom on the - # critical git remote; if working-set creeps toward 6Gi over days - # that's a leak to fix, not more RAM. + # restarts; briefly took the git remote + OCI registry down and + # spiked ingress TTFB/4xx). Steady-state ~2.2Gi but it spiked past + # the 3Gi cap. 4Gi is the CEILING here: the forgejo namespace + # tier-quota caps requests.memory at 4Gi and Guaranteed QoS means + # request == limit, so a pod can request at most 4Gi. A first + # attempt at 6Gi was REJECTED (FailedCreate: exceeded quota) and + # left forgejo with 0 pods until reverted -- do NOT raise memory + # past 4Gi without ALSO raising the tier-quota. The 6/9 OOM driver + # (tripit buildkit registry pushes) is gone now that the Forgejo + # registry was frozen + emptied 2026-06-13 (ADR-0002, ghcr), so the + # remaining spike is git ops / integrity-probe catalog walk / a + # possible leak; 4Gi should suffice. If it still OOMs, raise the + # tier-quota and this limit together. # requests=limits (Guaranteed QoS) per the repo memory convention. resources { requests = { cpu = "15m" - memory = "6Gi" + memory = "4Gi" } limits = { - memory = "6Gi" + memory = "4Gi" } } port { From 9501da81a075a69d864a6cdf01485e51d4bb5f9e Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 14:22:24 +0000 Subject: [PATCH 06/36] dbaas: document postgresql-backup startingDeadlineSeconds rationale Inline note on why the four backup CronJobs moved 10s->600s (bda1bdcb): a 10s deadline silently dropped the 2026-06-13 midnight full-backup run, firing PostgreSQLBackupStale. bda1bdcb rode in the same push as a forgejo change that failed CI on a namespace-quota error, so that pipeline failed before the dbaas apply took effect (live deadline was still 10s). This dbaas-only commit re-triggers the dbaas apply at a clean master so the 600s deadline actually goes live. Co-Authored-By: Claude Opus 4.8 --- stacks/dbaas/modules/dbaas/main.tf | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/stacks/dbaas/modules/dbaas/main.tf b/stacks/dbaas/modules/dbaas/main.tf index 9d450689..479263ed 100644 --- a/stacks/dbaas/modules/dbaas/main.tf +++ b/stacks/dbaas/modules/dbaas/main.tf @@ -1607,6 +1607,11 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" { failed_jobs_history_limit = 5 schedule = "0 0 * * *" # schedule = "* * * * *" + # 600s (was 10s): a 10s deadline silently DROPPED the 2026-06-13 00:00 run + # when the CronJob controller was late at the midnight backup/IO-storm tick, + # leaving the last full dump 37h old (fired PostgreSQLBackupStale). 600s lets + # a brief controller lag still launch the job. Same fix on the other three + # dbaas backup crons (they share the midnight window). starting_deadline_seconds = 600 successful_jobs_history_limit = 10 job_template { From b906f61ac30f7e53019181aa7985abfa534a0777 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 15:21:35 +0000 Subject: [PATCH 07/36] k8s-portal: build off-infra GHA -> ghcr + Keel; remove Woodpecker build (no-local-builds) The last in-cluster image build. GHA build-k8s-portal.yml builds ghcr.io/viktorbarzin/k8s-portal:latest+sha (path-filtered on the Dockerfile dir); Keel (force/poll/match-tag) rolls the deployment. Stack image repointed to ghcr (ignore_changed); .woodpecker/k8s-portal.yml deleted. Co-Authored-By: Claude Fable 5 --- .github/workflows/build-k8s-portal.yml | 36 ++++++++++++++ .woodpecker/k8s-portal.yml | 49 -------------------- stacks/k8s-portal/modules/k8s-portal/main.tf | 18 +++++-- 3 files changed, 50 insertions(+), 53 deletions(-) create mode 100644 .github/workflows/build-k8s-portal.yml delete mode 100644 .woodpecker/k8s-portal.yml diff --git a/.github/workflows/build-k8s-portal.yml b/.github/workflows/build-k8s-portal.yml new file mode 100644 index 00000000..c2679d43 --- /dev/null +++ b/.github/workflows/build-k8s-portal.yml @@ -0,0 +1,36 @@ +name: Build k8s-portal + +# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra +# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces +# the in-cluster .woodpecker/k8s-portal.yml build. +on: + push: + branches: [master] + paths: + - 'stacks/platform/modules/k8s-portal/files/**' + workflow_dispatch: {} + +permissions: + contents: read + packages: write + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: docker/setup-buildx-action@v3 + - uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + - uses: docker/build-push-action@v6 + with: + context: stacks/platform/modules/k8s-portal/files + platforms: linux/amd64 + provenance: false + push: true + tags: | + ghcr.io/viktorbarzin/k8s-portal:latest + ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }} diff --git a/.woodpecker/k8s-portal.yml b/.woodpecker/k8s-portal.yml deleted file mode 100644 index 39c9ff17..00000000 --- a/.woodpecker/k8s-portal.yml +++ /dev/null @@ -1,49 +0,0 @@ -when: - event: push - branch: master - path: - include: - - "stacks/platform/modules/k8s-portal/files/**" - -clone: - git: - image: woodpeckerci/plugin-git - settings: - attempts: 5 - backoff: 10s - -steps: - - name: build-and-push - image: woodpeckerci/plugin-docker-buildx - settings: - username: "viktorbarzin" - password: - from_secret: dockerhub-pat - repo: viktorbarzin/k8s-portal - dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile - context: stacks/platform/modules/k8s-portal/files - platforms: - - linux/amd64 - tag: ["${CI_PIPELINE_NUMBER}", "latest"] - cache_from: "viktorbarzin/k8s-portal:latest" - cache_to: "type=inline" - - - name: deploy - image: bitnami/kubectl:latest - commands: - - "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal" - - "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s" - - "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'" - - - name: slack - image: curlimages/curl - commands: - - | - curl -s -X POST -H 'Content-type: application/json' \ - --data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \ - "$SLACK_WEBHOOK" || true - environment: - SLACK_WEBHOOK: - from_secret: slack_webhook - when: - status: [success, failure] diff --git a/stacks/k8s-portal/modules/k8s-portal/main.tf b/stacks/k8s-portal/modules/k8s-portal/main.tf index 60057635..908fca49 100644 --- a/stacks/k8s-portal/modules/k8s-portal/main.tf +++ b/stacks/k8s-portal/modules/k8s-portal/main.tf @@ -9,7 +9,7 @@ resource "kubernetes_namespace" "k8s_portal" { metadata { name = "k8s-portal" labels = { - tier = var.tier + tier = var.tier "keel.sh/enrolled" = "true" } } @@ -40,6 +40,15 @@ resource "kubernetes_deployment" "k8s_portal" { metadata { name = "k8s-portal" namespace = kubernetes_namespace.k8s_portal.metadata[0].name + # ADR-0002 / no-local-builds: image now GHA-built -> ghcr:latest + # (.github/workflows/build-k8s-portal.yml). Keel polls ghcr:latest and rolls + # this deployment (replaces the removed Woodpecker in-cluster build+deploy). + annotations = { + "keel.sh/policy" = "force" + "keel.sh/trigger" = "poll" + "keel.sh/pollSchedule" = "@every 5m" + "keel.sh/match-tag" = "true" + } labels = { app = "k8s-portal" tier = var.tier @@ -68,7 +77,7 @@ resource "kubernetes_deployment" "k8s_portal" { spec { container { name = "portal" - image = "viktorbarzin/k8s-portal:latest" + image = "ghcr.io/viktorbarzin/k8s-portal:latest" port { container_port = 3000 } @@ -121,7 +130,8 @@ resource "kubernetes_deployment" "k8s_portal" { # DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA); Kyverno mutates dns_config for ndots. Reviewed 2026-04-18. ignore_changes = [ spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 - spec[0].template[0].spec[0].container[0].image, # CI updates image tag + spec[0].template[0].spec[0].container[0].image, # Keel manages ghcr:latest digest + metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 (Keel stamps on roll) ] } } @@ -172,5 +182,5 @@ module "ingress_setup_script" { ingress_path = ["/setup/script", "/agent"] tls_secret_name = var.tls_secret_name # auth = "none": Setup script + agent endpoint must be curl-able without auth (no cookies preserved in automation). - auth = "none" + auth = "none" } From bdfdf8db725e22314441a89c2cce639bbfb36f5f Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 15:23:46 +0000 Subject: [PATCH 08/36] fix(ci): k8s-portal build context is stacks/k8s-portal/modules/k8s-portal/files (was stale platform/ path) Co-Authored-By: Claude Fable 5 --- .github/workflows/build-k8s-portal.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build-k8s-portal.yml b/.github/workflows/build-k8s-portal.yml index c2679d43..f81e13af 100644 --- a/.github/workflows/build-k8s-portal.yml +++ b/.github/workflows/build-k8s-portal.yml @@ -7,7 +7,7 @@ on: push: branches: [master] paths: - - 'stacks/platform/modules/k8s-portal/files/**' + - 'stacks/k8s-portal/modules/k8s-portal/files/**' workflow_dispatch: {} permissions: @@ -27,7 +27,7 @@ jobs: password: ${{ secrets.GITHUB_TOKEN }} - uses: docker/build-push-action@v6 with: - context: stacks/platform/modules/k8s-portal/files + context: stacks/k8s-portal/modules/k8s-portal/files platforms: linux/amd64 provenance: false push: true From a9b08c03cf52e8ca31e4ba566700711b1e898fba Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 15:26:42 +0000 Subject: [PATCH 09/36] fix(k8s-portal): npm install (no committed lockfile) so GHA can build MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit package-lock.json was never committed to either lineage — npm ci needs it, so the build only ever worked from a manual devvm build with a local lock. npm install resolves from package.json, unblocking the GHA build (ADR-0002). Co-Authored-By: Claude Fable 5 --- stacks/k8s-portal/modules/k8s-portal/files/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/stacks/k8s-portal/modules/k8s-portal/files/Dockerfile b/stacks/k8s-portal/modules/k8s-portal/files/Dockerfile index aa694722..9ef4ba0c 100644 --- a/stacks/k8s-portal/modules/k8s-portal/files/Dockerfile +++ b/stacks/k8s-portal/modules/k8s-portal/files/Dockerfile @@ -1,7 +1,7 @@ FROM node:22-alpine AS build WORKDIR /app COPY package*.json ./ -RUN npm ci +RUN npm install --no-audit --no-fund COPY . . RUN npm run build From a7d33abec9e35a69c1e26b326446d72bcf6b1038 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 15:29:27 +0000 Subject: [PATCH 10/36] =?UTF-8?q?k8s-portal:=20commit=20package.json=20+?= =?UTF-8?q?=20lock=20(force;=20was=20gitignored)=20=E2=80=94=20unblocks=20?= =?UTF-8?q?GHA=20build?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Recovered the real manifest + resolved lockfile (lockfileVersion 3, 71 pkgs) from the running pod. A parent .gitignore force-ignored package.json, so the git source tree was incomplete and the image only ever built manually. Now reproducible on GHA (ADR-0002 no-local-builds). Co-Authored-By: Claude Fable 5 --- .../k8s-portal/files/package-lock.json | 1068 +++++++++++++++++ .../modules/k8s-portal/files/package.json | 24 + 2 files changed, 1092 insertions(+) create mode 100644 stacks/k8s-portal/modules/k8s-portal/files/package-lock.json create mode 100644 stacks/k8s-portal/modules/k8s-portal/files/package.json diff --git a/stacks/k8s-portal/modules/k8s-portal/files/package-lock.json b/stacks/k8s-portal/modules/k8s-portal/files/package-lock.json new file mode 100644 index 00000000..474c5a3b --- /dev/null +++ b/stacks/k8s-portal/modules/k8s-portal/files/package-lock.json @@ -0,0 +1,1068 @@ +{ + "name": "files", + "version": "0.0.1", + "lockfileVersion": 3, + "requires": true, + "packages": { + "node_modules/@esbuild/linux-x64": { + "version": "0.27.3", + "resolved": "https://registry.npmjs.org/@esbuild/linux-x64/-/linux-x64-0.27.3.tgz", + "integrity": "sha512-Czi8yzXUWIQYAtL/2y6vogER8pvcsOsk5cpwL4Gk5nJqH5UZiVByIY8Eorm5R13gq+DQKYg0+JyQoytLQas4dA==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ], + "engines": { + "node": ">=18" + } + }, + "node_modules/@jridgewell/gen-mapping": { + "version": "0.3.13", + "resolved": "https://registry.npmjs.org/@jridgewell/gen-mapping/-/gen-mapping-0.3.13.tgz", + "integrity": "sha512-2kkt/7niJ6MgEPxF0bYdQ6etZaA+fQvDcLKckhy1yIQOzaoKjBBjSj63/aLVjYE3qhRt5dvM+uUyfCg6UKCBbA==", + "dev": true, + "license": "MIT", + "dependencies": { + "@jridgewell/sourcemap-codec": "^1.5.0", + "@jridgewell/trace-mapping": "^0.3.24" + } + }, + "node_modules/@jridgewell/remapping": { + "version": "2.3.5", + "resolved": "https://registry.npmjs.org/@jridgewell/remapping/-/remapping-2.3.5.tgz", + "integrity": "sha512-LI9u/+laYG4Ds1TDKSJW2YPrIlcVYOwi2fUC6xB43lueCjgxV4lffOCZCtYFiH6TNOX+tQKXx97T4IKHbhyHEQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "@jridgewell/gen-mapping": "^0.3.5", + "@jridgewell/trace-mapping": "^0.3.24" + } + }, + "node_modules/@jridgewell/resolve-uri": { + "version": "3.1.2", + "resolved": "https://registry.npmjs.org/@jridgewell/resolve-uri/-/resolve-uri-3.1.2.tgz", + "integrity": "sha512-bRISgCIjP20/tbWSPWMEi54QVPRZExkuD9lJL+UIxUKtwVJA8wW1Trb1jMs1RFXo1CBTNZ/5hpC9QvmKWdopKw==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=6.0.0" + } + }, + "node_modules/@jridgewell/sourcemap-codec": { + "version": "1.5.5", + "resolved": "https://registry.npmjs.org/@jridgewell/sourcemap-codec/-/sourcemap-codec-1.5.5.tgz", + "integrity": "sha512-cYQ9310grqxueWbl+WuIUIaiUaDcj7WOq5fVhEljNVgRfOUhY9fy2zTvfoqWsnebh8Sl70VScFbICvJnLKB0Og==", + "dev": true, + "license": "MIT" + }, + "node_modules/@jridgewell/trace-mapping": { + "version": "0.3.31", + "resolved": "https://registry.npmjs.org/@jridgewell/trace-mapping/-/trace-mapping-0.3.31.tgz", + "integrity": "sha512-zzNR+SdQSDJzc8joaeP8QQoCQr8NuYx2dIIytl1QeBEZHJ9uW6hebsrYgbz8hJwUQao3TWCMtmfV8Nu1twOLAw==", + "dev": true, + "license": "MIT", + "dependencies": { + "@jridgewell/resolve-uri": "^3.1.0", + "@jridgewell/sourcemap-codec": "^1.4.14" + } + }, + "node_modules/@polka/url": { + "version": "1.0.0-next.29", + "resolved": "https://registry.npmjs.org/@polka/url/-/url-1.0.0-next.29.tgz", + "integrity": "sha512-wwQAWhWSuHaag8c4q/KN/vCoeOJYshAIvMQwD4GpSb3OiZklFfvAgmj0VCBBImRpuF/aFgIRzllXlVX93Jevww==", + "dev": true, + "license": "MIT" + }, + "node_modules/@rollup/plugin-commonjs": { + "version": "29.0.0", + "resolved": "https://registry.npmjs.org/@rollup/plugin-commonjs/-/plugin-commonjs-29.0.0.tgz", + "integrity": "sha512-U2YHaxR2cU/yAiwKJtJRhnyLk7cifnQw0zUpISsocBDoHDJn+HTV74ABqnwr5bEgWUwFZC9oFL6wLe21lHu5eQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "@rollup/pluginutils": "^5.0.1", + "commondir": "^1.0.1", + "estree-walker": "^2.0.2", + "fdir": "^6.2.0", + "is-reference": "1.2.1", + "magic-string": "^0.30.3", + "picomatch": "^4.0.2" + }, + "engines": { + "node": ">=16.0.0 || 14 >= 14.17" + }, + "peerDependencies": { + "rollup": "^2.68.0||^3.0.0||^4.0.0" + }, + "peerDependenciesMeta": { + "rollup": { + "optional": true + } + } + }, + "node_modules/@rollup/plugin-commonjs/node_modules/is-reference": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/is-reference/-/is-reference-1.2.1.tgz", + "integrity": "sha512-U82MsXXiFIrjCK4otLT+o2NA2Cd2g5MLoOVXUZjIOhLurrRxpEXzI8O0KZHr3IjLvlAH1kTPYSuqer5T9ZVBKQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "@types/estree": "*" + } + }, + "node_modules/@rollup/plugin-json": { + "version": "6.1.0", + "resolved": "https://registry.npmjs.org/@rollup/plugin-json/-/plugin-json-6.1.0.tgz", + "integrity": "sha512-EGI2te5ENk1coGeADSIwZ7G2Q8CJS2sF120T7jLw4xFw9n7wIOXHo+kIYRAoVpJAN+kmqZSoO3Fp4JtoNF4ReA==", + "dev": true, + "license": "MIT", + "dependencies": { + "@rollup/pluginutils": "^5.1.0" + }, + "engines": { + "node": ">=14.0.0" + }, + "peerDependencies": { + "rollup": "^1.20.0||^2.0.0||^3.0.0||^4.0.0" + }, + "peerDependenciesMeta": { + "rollup": { + "optional": true + } + } + }, + "node_modules/@rollup/plugin-node-resolve": { + "version": "16.0.3", + "resolved": "https://registry.npmjs.org/@rollup/plugin-node-resolve/-/plugin-node-resolve-16.0.3.tgz", + "integrity": "sha512-lUYM3UBGuM93CnMPG1YocWu7X802BrNF3jW2zny5gQyLQgRFJhV1Sq0Zi74+dh/6NBx1DxFC4b4GXg9wUCG5Qg==", + "dev": true, + "license": "MIT", + "dependencies": { + "@rollup/pluginutils": "^5.0.1", + "@types/resolve": "1.20.2", + "deepmerge": "^4.2.2", + "is-module": "^1.0.0", + "resolve": "^1.22.1" + }, + "engines": { + "node": ">=14.0.0" + }, + "peerDependencies": { + "rollup": "^2.78.0||^3.0.0||^4.0.0" + }, + "peerDependenciesMeta": { + "rollup": { + "optional": true + } + } + }, + "node_modules/@rollup/pluginutils": { + "version": "5.3.0", + "resolved": "https://registry.npmjs.org/@rollup/pluginutils/-/pluginutils-5.3.0.tgz", + "integrity": "sha512-5EdhGZtnu3V88ces7s53hhfK5KSASnJZv8Lulpc04cWO3REESroJXg73DFsOmgbU2BhwV0E20bu2IDZb3VKW4Q==", + "dev": true, + "license": "MIT", + "dependencies": { + "@types/estree": "^1.0.0", + "estree-walker": "^2.0.2", + "picomatch": "^4.0.2" + }, + "engines": { + "node": ">=14.0.0" + }, + "peerDependencies": { + "rollup": "^1.20.0||^2.0.0||^3.0.0||^4.0.0" + }, + "peerDependenciesMeta": { + "rollup": { + "optional": true + } + } + }, + "node_modules/@rollup/rollup-linux-x64-gnu": { + "version": "4.57.1", + "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-x64-gnu/-/rollup-linux-x64-gnu-4.57.1.tgz", + "integrity": "sha512-ABca4ceT4N+Tv/GtotnWAeXZUZuM/9AQyCyKYyKnpk4yoA7QIAuBt6Hkgpw8kActYlew2mvckXkvx0FfoInnLg==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ] + }, + "node_modules/@rollup/rollup-linux-x64-musl": { + "version": "4.57.1", + "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-x64-musl/-/rollup-linux-x64-musl-4.57.1.tgz", + "integrity": "sha512-HFps0JeGtuOR2convgRRkHCekD7j+gdAuXM+/i6kGzQtFhlCtQkpwtNzkNj6QhCDp7DRJ7+qC/1Vg2jt5iSOFw==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "linux" + ] + }, + "node_modules/@standard-schema/spec": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/@standard-schema/spec/-/spec-1.1.0.tgz", + "integrity": "sha512-l2aFy5jALhniG5HgqrD6jXLi/rUWrKvqN/qJx6yoJsgKhblVd+iqqU4RCXavm/jPityDo5TCvKMnpjKnOriy0w==", + "dev": true, + "license": "MIT" + }, + "node_modules/@sveltejs/acorn-typescript": { + "version": "1.0.9", + "resolved": "https://registry.npmjs.org/@sveltejs/acorn-typescript/-/acorn-typescript-1.0.9.tgz", + "integrity": "sha512-lVJX6qEgs/4DOcRTpo56tmKzVPtoWAaVbL4hfO7t7NVwl9AAXzQR6cihesW1BmNMPl+bK6dreu2sOKBP2Q9CIA==", + "dev": true, + "license": "MIT", + "peerDependencies": { + "acorn": "^8.9.0" + } + }, + "node_modules/@sveltejs/adapter-auto": { + "version": "7.0.1", + "resolved": "https://registry.npmjs.org/@sveltejs/adapter-auto/-/adapter-auto-7.0.1.tgz", + "integrity": "sha512-dvuPm1E7M9NI/+canIQ6KKQDU2AkEefEZ2Dp7cY6uKoPq9Z/PhOXABe526UdW2mN986gjVkuSLkOYIBnS/M2LQ==", + "dev": true, + "license": "MIT", + "peerDependencies": { + "@sveltejs/kit": "^2.0.0" + } + }, + "node_modules/@sveltejs/adapter-node": { + "version": "5.5.3", + "resolved": "https://registry.npmjs.org/@sveltejs/adapter-node/-/adapter-node-5.5.3.tgz", + "integrity": "sha512-yeWbKXBL9vqDb/7R8ebvRHeuBHN4cRYYBSquNJSMQtS6rIYkXxsVSveaMTUaLvHYQsb1zNa+nH2iLTOMawBohA==", + "dev": true, + "license": "MIT", + "dependencies": { + "@rollup/plugin-commonjs": "^29.0.0", + "@rollup/plugin-json": "^6.1.0", + "@rollup/plugin-node-resolve": "^16.0.0", + "rollup": "^4.9.5" + }, + "peerDependencies": { + "@sveltejs/kit": "^2.4.0" + } + }, + "node_modules/@sveltejs/kit": { + "version": "2.52.0", + "resolved": "https://registry.npmjs.org/@sveltejs/kit/-/kit-2.52.0.tgz", + "integrity": "sha512-zG+HmJuSF7eC0e7xt2htlOcEMAdEtlVdb7+gAr+ef08EhtwUsjLxcAwBgUCJY3/5p08OVOxVZti91WfXeuLvsg==", + "dev": true, + "license": "MIT", + "dependencies": { + "@standard-schema/spec": "^1.0.0", + "@sveltejs/acorn-typescript": "^1.0.5", + "@types/cookie": "^0.6.0", + "acorn": "^8.14.1", + "cookie": "^0.6.0", + "devalue": "^5.6.2", + "esm-env": "^1.2.2", + "kleur": "^4.1.5", + "magic-string": "^0.30.5", + "mrmime": "^2.0.0", + "sade": "^1.8.1", + "set-cookie-parser": "^3.0.0", + "sirv": "^3.0.0" + }, + "bin": { + "svelte-kit": "svelte-kit.js" + }, + "engines": { + "node": ">=18.13" + }, + "peerDependencies": { + "@opentelemetry/api": "^1.0.0", + "@sveltejs/vite-plugin-svelte": "^3.0.0 || ^4.0.0-next.1 || ^5.0.0 || ^6.0.0-next.0", + "svelte": "^4.0.0 || ^5.0.0-next.0", + "typescript": "^5.3.3", + "vite": "^5.0.3 || ^6.0.0 || ^7.0.0-beta.0" + }, + "peerDependenciesMeta": { + "@opentelemetry/api": { + "optional": true + }, + "typescript": { + "optional": true + } + } + }, + "node_modules/@sveltejs/vite-plugin-svelte": { + "version": "6.2.4", + "resolved": "https://registry.npmjs.org/@sveltejs/vite-plugin-svelte/-/vite-plugin-svelte-6.2.4.tgz", + "integrity": "sha512-ou/d51QSdTyN26D7h6dSpusAKaZkAiGM55/AKYi+9AGZw7q85hElbjK3kEyzXHhLSnRISHOYzVge6x0jRZ7DXA==", + "dev": true, + "license": "MIT", + "dependencies": { + "@sveltejs/vite-plugin-svelte-inspector": "^5.0.0", + "deepmerge": "^4.3.1", + "magic-string": "^0.30.21", + "obug": "^2.1.0", + "vitefu": "^1.1.1" + }, + "engines": { + "node": "^20.19 || ^22.12 || >=24" + }, + "peerDependencies": { + "svelte": "^5.0.0", + "vite": "^6.3.0 || ^7.0.0" + } + }, + "node_modules/@sveltejs/vite-plugin-svelte-inspector": { + "version": "5.0.2", + "resolved": "https://registry.npmjs.org/@sveltejs/vite-plugin-svelte-inspector/-/vite-plugin-svelte-inspector-5.0.2.tgz", + "integrity": "sha512-TZzRTcEtZffICSAoZGkPSl6Etsj2torOVrx6Uw0KpXxrec9Gg6jFWQ60Q3+LmNGfZSxHRCZL7vXVZIWmuV50Ig==", + "dev": true, + "license": "MIT", + "dependencies": { + "obug": "^2.1.0" + }, + "engines": { + "node": "^20.19 || ^22.12 || >=24" + }, + "peerDependencies": { + "@sveltejs/vite-plugin-svelte": "^6.0.0-next.0", + "svelte": "^5.0.0", + "vite": "^6.3.0 || ^7.0.0" + } + }, + "node_modules/@types/cookie": { + "version": "0.6.0", + "resolved": "https://registry.npmjs.org/@types/cookie/-/cookie-0.6.0.tgz", + "integrity": "sha512-4Kh9a6B2bQciAhf7FSuMRRkUWecJgJu9nPnx3yzpsfXX/c50REIqpHY4C82bXP90qrLtXtkDxTZosYO3UpOwlA==", + "dev": true, + "license": "MIT" + }, + "node_modules/@types/estree": { + "version": "1.0.8", + "resolved": "https://registry.npmjs.org/@types/estree/-/estree-1.0.8.tgz", + "integrity": "sha512-dWHzHa2WqEXI/O1E9OjrocMTKJl2mSrEolh1Iomrv6U+JuNwaHXsXx9bLu5gG7BUWFIN0skIQJQ/L1rIex4X6w==", + "dev": true, + "license": "MIT" + }, + "node_modules/@types/resolve": { + "version": "1.20.2", + "resolved": "https://registry.npmjs.org/@types/resolve/-/resolve-1.20.2.tgz", + "integrity": "sha512-60BCwRFOZCQhDncwQdxxeOEEkbc5dIMccYLwbxsS4TUNeVECQ/pBJ0j09mrHOl/JJvpRPGwO9SvE4nR2Nb/a4Q==", + "dev": true, + "license": "MIT" + }, + "node_modules/@types/trusted-types": { + "version": "2.0.7", + "resolved": "https://registry.npmjs.org/@types/trusted-types/-/trusted-types-2.0.7.tgz", + "integrity": "sha512-ScaPdn1dQczgbl0QFTeTOmVHFULt394XJgOQNoyVhZ6r2vLnMLJfBPd53SB52T/3G36VI1/g2MZaX0cwDuXsfw==", + "dev": true, + "license": "MIT" + }, + "node_modules/acorn": { + "version": "8.15.0", + "resolved": "https://registry.npmjs.org/acorn/-/acorn-8.15.0.tgz", + "integrity": "sha512-NZyJarBfL7nWwIq+FDL6Zp/yHEhePMNnnJ0y3qfieCrmNvYct8uvtiV41UvlSe6apAfk0fY1FbWx+NwfmpvtTg==", + "dev": true, + "license": "MIT", + "bin": { + "acorn": "bin/acorn" + }, + "engines": { + "node": ">=0.4.0" + } + }, + "node_modules/aria-query": { + "version": "5.3.2", + "resolved": "https://registry.npmjs.org/aria-query/-/aria-query-5.3.2.tgz", + "integrity": "sha512-COROpnaoap1E2F000S62r6A60uHZnmlvomhfyT2DlTcrY1OrBKn2UhH7qn5wTC9zMvD0AY7csdPSNwKP+7WiQw==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": ">= 0.4" + } + }, + "node_modules/axobject-query": { + "version": "4.1.0", + "resolved": "https://registry.npmjs.org/axobject-query/-/axobject-query-4.1.0.tgz", + "integrity": "sha512-qIj0G9wZbMGNLjLmg1PT6v2mE9AH2zlnADJD/2tC6E00hgmhUOfEB6greHPAfLRSufHqROIUTkw6E+M3lH0PTQ==", + "dev": true, + "license": "Apache-2.0", + "engines": { + "node": ">= 0.4" + } + }, + "node_modules/chokidar": { + "version": "4.0.3", + "resolved": "https://registry.npmjs.org/chokidar/-/chokidar-4.0.3.tgz", + "integrity": "sha512-Qgzu8kfBvo+cA4962jnP1KkS6Dop5NS6g7R5LFYJr4b8Ub94PPQXUksCw9PvXoeXPRRddRNC5C1JQUR2SMGtnA==", + "dev": true, + "license": "MIT", + "dependencies": { + "readdirp": "^4.0.1" + }, + "engines": { + "node": ">= 14.16.0" + }, + "funding": { + "url": "https://paulmillr.com/funding/" + } + }, + "node_modules/clsx": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/clsx/-/clsx-2.1.1.tgz", + "integrity": "sha512-eYm0QWBtUrBWZWG0d386OGAw16Z995PiOVo2B7bjWSbHedGl5e0ZWaq65kOGgUSNesEIDkB9ISbTg/JK9dhCZA==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=6" + } + }, + "node_modules/commondir": { + "version": "1.0.1", + "resolved": "https://registry.npmjs.org/commondir/-/commondir-1.0.1.tgz", + "integrity": "sha512-W9pAhw0ja1Edb5GVdIF1mjZw/ASI0AlShXM83UUGe2DVr5TdAPEA1OA8m/g8zWp9x6On7gqufY+FatDbC3MDQg==", + "dev": true, + "license": "MIT" + }, + "node_modules/cookie": { + "version": "0.6.0", + "resolved": "https://registry.npmjs.org/cookie/-/cookie-0.6.0.tgz", + "integrity": "sha512-U71cyTamuh1CRNCfpGY6to28lxvNwPG4Guz/EVjgf3Jmzv0vlDp1atT9eS5dDjMYHucpHbWns6Lwf3BKz6svdw==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/deepmerge": { + "version": "4.3.1", + "resolved": "https://registry.npmjs.org/deepmerge/-/deepmerge-4.3.1.tgz", + "integrity": "sha512-3sUqbMEc77XqpdNO7FRyRog+eW3ph+GYCbj+rK+uYyRMuwsVy0rMiVtPn+QJlKFvWP/1PYpapqYn0Me2knFn+A==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/devalue": { + "version": "5.6.2", + "resolved": "https://registry.npmjs.org/devalue/-/devalue-5.6.2.tgz", + "integrity": "sha512-nPRkjWzzDQlsejL1WVifk5rvcFi/y1onBRxjaFMjZeR9mFpqu2gmAZ9xUB9/IEanEP/vBtGeGganC/GO1fmufg==", + "dev": true, + "license": "MIT" + }, + "node_modules/esbuild": { + "version": "0.27.3", + "resolved": "https://registry.npmjs.org/esbuild/-/esbuild-0.27.3.tgz", + "integrity": "sha512-8VwMnyGCONIs6cWue2IdpHxHnAjzxnw2Zr7MkVxB2vjmQ2ivqGFb4LEG3SMnv0Gb2F/G/2yA8zUaiL1gywDCCg==", + "dev": true, + "hasInstallScript": true, + "license": "MIT", + "bin": { + "esbuild": "bin/esbuild" + }, + "engines": { + "node": ">=18" + }, + "optionalDependencies": { + "@esbuild/aix-ppc64": "0.27.3", + "@esbuild/android-arm": "0.27.3", + "@esbuild/android-arm64": "0.27.3", + "@esbuild/android-x64": "0.27.3", + "@esbuild/darwin-arm64": "0.27.3", + "@esbuild/darwin-x64": "0.27.3", + "@esbuild/freebsd-arm64": "0.27.3", + "@esbuild/freebsd-x64": "0.27.3", + "@esbuild/linux-arm": "0.27.3", + "@esbuild/linux-arm64": "0.27.3", + "@esbuild/linux-ia32": "0.27.3", + "@esbuild/linux-loong64": "0.27.3", + "@esbuild/linux-mips64el": "0.27.3", + "@esbuild/linux-ppc64": "0.27.3", + "@esbuild/linux-riscv64": "0.27.3", + "@esbuild/linux-s390x": "0.27.3", + "@esbuild/linux-x64": "0.27.3", + "@esbuild/netbsd-arm64": "0.27.3", + "@esbuild/netbsd-x64": "0.27.3", + "@esbuild/openbsd-arm64": "0.27.3", + "@esbuild/openbsd-x64": "0.27.3", + "@esbuild/openharmony-arm64": "0.27.3", + "@esbuild/sunos-x64": "0.27.3", + "@esbuild/win32-arm64": "0.27.3", + "@esbuild/win32-ia32": "0.27.3", + "@esbuild/win32-x64": "0.27.3" + } + }, + "node_modules/esm-env": { + "version": "1.2.2", + "resolved": "https://registry.npmjs.org/esm-env/-/esm-env-1.2.2.tgz", + "integrity": "sha512-Epxrv+Nr/CaL4ZcFGPJIYLWFom+YeV1DqMLHJoEd9SYRxNbaFruBwfEX/kkHUJf55j2+TUbmDcmuilbP1TmXHA==", + "dev": true, + "license": "MIT" + }, + "node_modules/esrap": { + "version": "2.2.3", + "resolved": "https://registry.npmjs.org/esrap/-/esrap-2.2.3.tgz", + "integrity": "sha512-8fOS+GIGCQZl/ZIlhl59htOlms6U8NvX6ZYgYHpRU/b6tVSh3uHkOHZikl3D4cMbYM0JlpBe+p/BkZEi8J9XIQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "@jridgewell/sourcemap-codec": "^1.4.15" + } + }, + "node_modules/estree-walker": { + "version": "2.0.2", + "resolved": "https://registry.npmjs.org/estree-walker/-/estree-walker-2.0.2.tgz", + "integrity": "sha512-Rfkk/Mp/DL7JVje3u18FxFujQlTNR2q6QfMSMB7AvCBx91NGj/ba3kCfza0f6dVDbw7YlRf/nDrn7pQrCCyQ/w==", + "dev": true, + "license": "MIT" + }, + "node_modules/fdir": { + "version": "6.5.0", + "resolved": "https://registry.npmjs.org/fdir/-/fdir-6.5.0.tgz", + "integrity": "sha512-tIbYtZbucOs0BRGqPJkshJUYdL+SDH7dVM8gjy+ERp3WAUjLEFJE+02kanyHtwjWOnwrKYBiwAmM0p4kLJAnXg==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=12.0.0" + }, + "peerDependencies": { + "picomatch": "^3 || ^4" + }, + "peerDependenciesMeta": { + "picomatch": { + "optional": true + } + } + }, + "node_modules/function-bind": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/function-bind/-/function-bind-1.1.2.tgz", + "integrity": "sha512-7XHNxH7qX9xG5mIwxkhumTox/MIRNcOgDrxWsMt2pAr23WHp6MrRlN7FBSFpCpr+oVO0F744iUgR82nJMfG2SA==", + "dev": true, + "license": "MIT", + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/hasown": { + "version": "2.0.2", + "resolved": "https://registry.npmjs.org/hasown/-/hasown-2.0.2.tgz", + "integrity": "sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "function-bind": "^1.1.2" + }, + "engines": { + "node": ">= 0.4" + } + }, + "node_modules/is-core-module": { + "version": "2.16.1", + "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.16.1.tgz", + "integrity": "sha512-UfoeMA6fIJ8wTYFEUjelnaGI67v6+N7qXJEvQuIGa99l4xsCruSYOVSQ0uPANn4dAzm8lkYPaKLrrijLq7x23w==", + "dev": true, + "license": "MIT", + "dependencies": { + "hasown": "^2.0.2" + }, + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/is-module": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/is-module/-/is-module-1.0.0.tgz", + "integrity": "sha512-51ypPSPCoTEIN9dy5Oy+h4pShgJmPCygKfyRCISBI+JoWT/2oJvK8QPxmwv7b/p239jXrm9M1mlQbyKJ5A152g==", + "dev": true, + "license": "MIT" + }, + "node_modules/is-reference": { + "version": "3.0.3", + "resolved": "https://registry.npmjs.org/is-reference/-/is-reference-3.0.3.tgz", + "integrity": "sha512-ixkJoqQvAP88E6wLydLGGqCJsrFUnqoH6HnaczB8XmDH1oaWU+xxdptvikTgaEhtZ53Ky6YXiBuUI2WXLMCwjw==", + "dev": true, + "license": "MIT", + "dependencies": { + "@types/estree": "^1.0.6" + } + }, + "node_modules/kleur": { + "version": "4.1.5", + "resolved": "https://registry.npmjs.org/kleur/-/kleur-4.1.5.tgz", + "integrity": "sha512-o+NO+8WrRiQEE4/7nwRJhN1HWpVmJm511pBHUxPLtp0BUISzlBplORYSmTclCnJvQq2tKu/sgl3xVpkc7ZWuQQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=6" + } + }, + "node_modules/locate-character": { + "version": "3.0.0", + "resolved": "https://registry.npmjs.org/locate-character/-/locate-character-3.0.0.tgz", + "integrity": "sha512-SW13ws7BjaeJ6p7Q6CO2nchbYEc3X3J6WrmTTDto7yMPqVSZTUyY5Tjbid+Ab8gLnATtygYtiDIJGQRRn2ZOiA==", + "dev": true, + "license": "MIT" + }, + "node_modules/magic-string": { + "version": "0.30.21", + "resolved": "https://registry.npmjs.org/magic-string/-/magic-string-0.30.21.tgz", + "integrity": "sha512-vd2F4YUyEXKGcLHoq+TEyCjxueSeHnFxyyjNp80yg0XV4vUhnDer/lvvlqM/arB5bXQN5K2/3oinyCRyx8T2CQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "@jridgewell/sourcemap-codec": "^1.5.5" + } + }, + "node_modules/mri": { + "version": "1.2.0", + "resolved": "https://registry.npmjs.org/mri/-/mri-1.2.0.tgz", + "integrity": "sha512-tzzskb3bG8LvYGFF/mDTpq3jpI6Q9wc3LEmBaghu+DdCssd1FakN7Bc0hVNmEyGq1bq3RgfkCb3cmQLpNPOroA==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=4" + } + }, + "node_modules/mrmime": { + "version": "2.0.1", + "resolved": "https://registry.npmjs.org/mrmime/-/mrmime-2.0.1.tgz", + "integrity": "sha512-Y3wQdFg2Va6etvQ5I82yUhGdsKrcYox6p7FfL1LbK2J4V01F9TGlepTIhnK24t7koZibmg82KGglhA1XK5IsLQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=10" + } + }, + "node_modules/nanoid": { + "version": "3.3.11", + "resolved": "https://registry.npmjs.org/nanoid/-/nanoid-3.3.11.tgz", + "integrity": "sha512-N8SpfPUnUp1bK+PMYW8qSWdl9U+wwNWI4QKxOYDy9JAro3WMX7p2OeVRF9v+347pnakNevPmiHhNmZ2HbFA76w==", + "dev": true, + "funding": [ + { + "type": "github", + "url": "https://github.com/sponsors/ai" + } + ], + "license": "MIT", + "bin": { + "nanoid": "bin/nanoid.cjs" + }, + "engines": { + "node": "^10 || ^12 || ^13.7 || ^14 || >=15.0.1" + } + }, + "node_modules/obug": { + "version": "2.1.1", + "resolved": "https://registry.npmjs.org/obug/-/obug-2.1.1.tgz", + "integrity": "sha512-uTqF9MuPraAQ+IsnPf366RG4cP9RtUi7MLO1N3KEc+wb0a6yKpeL0lmk2IB1jY5KHPAlTc6T/JRdC/YqxHNwkQ==", + "dev": true, + "funding": [ + "https://github.com/sponsors/sxzz", + "https://opencollective.com/debug" + ], + "license": "MIT" + }, + "node_modules/path-parse": { + "version": "1.0.7", + "resolved": "https://registry.npmjs.org/path-parse/-/path-parse-1.0.7.tgz", + "integrity": "sha512-LDJzPVEEEPR+y48z93A0Ed0yXb8pAByGWo/k5YYdYgpY2/2EsOsksJrq7lOHxryrVOn1ejG6oAp8ahvOIQD8sw==", + "dev": true, + "license": "MIT" + }, + "node_modules/picocolors": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/picocolors/-/picocolors-1.1.1.tgz", + "integrity": "sha512-xceH2snhtb5M9liqDsmEw56le376mTZkEX/jEb/RxNFyegNul7eNslCXP9FDj/Lcu0X8KEyMceP2ntpaHrDEVA==", + "dev": true, + "license": "ISC" + }, + "node_modules/picomatch": { + "version": "4.0.3", + "resolved": "https://registry.npmjs.org/picomatch/-/picomatch-4.0.3.tgz", + "integrity": "sha512-5gTmgEY/sqK6gFXLIsQNH19lWb4ebPDLA4SdLP7dsWkIXHWlG66oPuVvXSGFPppYZz8ZDZq0dYYrbHfBCVUb1Q==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=12" + }, + "funding": { + "url": "https://github.com/sponsors/jonschlinkert" + } + }, + "node_modules/postcss": { + "version": "8.5.6", + "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.5.6.tgz", + "integrity": "sha512-3Ybi1tAuwAP9s0r1UQ2J4n5Y0G05bJkpUIO0/bI9MhwmD70S5aTWbXGBwxHrelT+XM1k6dM0pk+SwNkpTRN7Pg==", + "dev": true, + "funding": [ + { + "type": "opencollective", + "url": "https://opencollective.com/postcss/" + }, + { + "type": "tidelift", + "url": "https://tidelift.com/funding/github/npm/postcss" + }, + { + "type": "github", + "url": "https://github.com/sponsors/ai" + } + ], + "license": "MIT", + "dependencies": { + "nanoid": "^3.3.11", + "picocolors": "^1.1.1", + "source-map-js": "^1.2.1" + }, + "engines": { + "node": "^10 || ^12 || >=14" + } + }, + "node_modules/readdirp": { + "version": "4.1.2", + "resolved": "https://registry.npmjs.org/readdirp/-/readdirp-4.1.2.tgz", + "integrity": "sha512-GDhwkLfywWL2s6vEjyhri+eXmfH6j1L7JE27WhqLeYzoh/A3DBaYGEj2H/HFZCn/kMfim73FXxEJTw06WtxQwg==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 14.18.0" + }, + "funding": { + "type": "individual", + "url": "https://paulmillr.com/funding/" + } + }, + "node_modules/resolve": { + "version": "1.22.11", + "resolved": "https://registry.npmjs.org/resolve/-/resolve-1.22.11.tgz", + "integrity": "sha512-RfqAvLnMl313r7c9oclB1HhUEAezcpLjz95wFH4LVuhk9JF/r22qmVP9AMmOU4vMX7Q8pN8jwNg/CSpdFnMjTQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "is-core-module": "^2.16.1", + "path-parse": "^1.0.7", + "supports-preserve-symlinks-flag": "^1.0.0" + }, + "bin": { + "resolve": "bin/resolve" + }, + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/rollup": { + "version": "4.57.1", + "resolved": "https://registry.npmjs.org/rollup/-/rollup-4.57.1.tgz", + "integrity": "sha512-oQL6lgK3e2QZeQ7gcgIkS2YZPg5slw37hYufJ3edKlfQSGGm8ICoxswK15ntSzF/a8+h7ekRy7k7oWc3BQ7y8A==", + "dev": true, + "license": "MIT", + "dependencies": { + "@types/estree": "1.0.8" + }, + "bin": { + "rollup": "dist/bin/rollup" + }, + "engines": { + "node": ">=18.0.0", + "npm": ">=8.0.0" + }, + "optionalDependencies": { + "@rollup/rollup-android-arm-eabi": "4.57.1", + "@rollup/rollup-android-arm64": "4.57.1", + "@rollup/rollup-darwin-arm64": "4.57.1", + "@rollup/rollup-darwin-x64": "4.57.1", + "@rollup/rollup-freebsd-arm64": "4.57.1", + "@rollup/rollup-freebsd-x64": "4.57.1", + "@rollup/rollup-linux-arm-gnueabihf": "4.57.1", + "@rollup/rollup-linux-arm-musleabihf": "4.57.1", + "@rollup/rollup-linux-arm64-gnu": "4.57.1", + "@rollup/rollup-linux-arm64-musl": "4.57.1", + "@rollup/rollup-linux-loong64-gnu": "4.57.1", + "@rollup/rollup-linux-loong64-musl": "4.57.1", + "@rollup/rollup-linux-ppc64-gnu": "4.57.1", + "@rollup/rollup-linux-ppc64-musl": "4.57.1", + "@rollup/rollup-linux-riscv64-gnu": "4.57.1", + "@rollup/rollup-linux-riscv64-musl": "4.57.1", + "@rollup/rollup-linux-s390x-gnu": "4.57.1", + "@rollup/rollup-linux-x64-gnu": "4.57.1", + "@rollup/rollup-linux-x64-musl": "4.57.1", + "@rollup/rollup-openbsd-x64": "4.57.1", + "@rollup/rollup-openharmony-arm64": "4.57.1", + "@rollup/rollup-win32-arm64-msvc": "4.57.1", + "@rollup/rollup-win32-ia32-msvc": "4.57.1", + "@rollup/rollup-win32-x64-gnu": "4.57.1", + "@rollup/rollup-win32-x64-msvc": "4.57.1", + "fsevents": "~2.3.2" + } + }, + "node_modules/sade": { + "version": "1.8.1", + "resolved": "https://registry.npmjs.org/sade/-/sade-1.8.1.tgz", + "integrity": "sha512-xal3CZX1Xlo/k4ApwCFrHVACi9fBqJ7V+mwhBsuf/1IOKbBy098Fex+Wa/5QMubw09pSZ/u8EY8PWgevJsXp1A==", + "dev": true, + "license": "MIT", + "dependencies": { + "mri": "^1.1.0" + }, + "engines": { + "node": ">=6" + } + }, + "node_modules/set-cookie-parser": { + "version": "3.0.1", + "resolved": "https://registry.npmjs.org/set-cookie-parser/-/set-cookie-parser-3.0.1.tgz", + "integrity": "sha512-n7Z7dXZhJbwuAHhNzkTti6Aw9QDDjZtm3JTpTGATIdNzdQz5GuFs22w90BcvF4INfnrL5xrX3oGsuqO5Dx3A1Q==", + "dev": true, + "license": "MIT" + }, + "node_modules/sirv": { + "version": "3.0.2", + "resolved": "https://registry.npmjs.org/sirv/-/sirv-3.0.2.tgz", + "integrity": "sha512-2wcC/oGxHis/BoHkkPwldgiPSYcpZK3JU28WoMVv55yHJgcZ8rlXvuG9iZggz+sU1d4bRgIGASwyWqjxu3FM0g==", + "dev": true, + "license": "MIT", + "dependencies": { + "@polka/url": "^1.0.0-next.24", + "mrmime": "^2.0.0", + "totalist": "^3.0.0" + }, + "engines": { + "node": ">=18" + } + }, + "node_modules/source-map-js": { + "version": "1.2.1", + "resolved": "https://registry.npmjs.org/source-map-js/-/source-map-js-1.2.1.tgz", + "integrity": "sha512-UXWMKhLOwVKb728IUtQPXxfYU+usdybtUrK/8uGE8CQMvrhOpwvzDBwj0QhSL7MQc7vIsISBG8VQ8+IDQxpfQA==", + "dev": true, + "license": "BSD-3-Clause", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/supports-preserve-symlinks-flag": { + "version": "1.0.0", + "resolved": "https://registry.npmjs.org/supports-preserve-symlinks-flag/-/supports-preserve-symlinks-flag-1.0.0.tgz", + "integrity": "sha512-ot0WnXS9fgdkgIcePe6RHNk1WA8+muPa6cSjeR3V8K27q9BB1rTE3R1p7Hv0z1ZyAc8s6Vvv8DIyWf681MAt0w==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/svelte": { + "version": "5.51.3", + "resolved": "https://registry.npmjs.org/svelte/-/svelte-5.51.3.tgz", + "integrity": "sha512-3+ni7BMjiEQeMCa1fDQzHy2ESAebgQDVOTuE4jlj2/QOAB2grRta8ew80p95miWE+ZmimpL7B3t9SSO4rv0aqQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "@jridgewell/remapping": "^2.3.4", + "@jridgewell/sourcemap-codec": "^1.5.0", + "@sveltejs/acorn-typescript": "^1.0.5", + "@types/estree": "^1.0.5", + "@types/trusted-types": "^2.0.7", + "acorn": "^8.12.1", + "aria-query": "^5.3.1", + "axobject-query": "^4.1.0", + "clsx": "^2.1.1", + "devalue": "^5.6.2", + "esm-env": "^1.2.1", + "esrap": "^2.2.2", + "is-reference": "^3.0.3", + "locate-character": "^3.0.0", + "magic-string": "^0.30.11", + "zimmerframe": "^1.1.2" + }, + "engines": { + "node": ">=18" + } + }, + "node_modules/svelte-check": { + "version": "4.4.0", + "resolved": "https://registry.npmjs.org/svelte-check/-/svelte-check-4.4.0.tgz", + "integrity": "sha512-gB3FdEPb8tPO3Y7Dzc6d/Pm/KrXAhK+0Fk+LkcysVtupvAh6Y/IrBCEZNupq57oh0hcwlxCUamu/rq7GtvfSEg==", + "dev": true, + "license": "MIT", + "dependencies": { + "@jridgewell/trace-mapping": "^0.3.25", + "chokidar": "^4.0.1", + "fdir": "^6.2.0", + "picocolors": "^1.0.0", + "sade": "^1.7.4" + }, + "bin": { + "svelte-check": "bin/svelte-check" + }, + "engines": { + "node": ">= 18.0.0" + }, + "peerDependencies": { + "svelte": "^4.0.0 || ^5.0.0-next.0", + "typescript": ">=5.0.0" + } + }, + "node_modules/tinyglobby": { + "version": "0.2.15", + "resolved": "https://registry.npmjs.org/tinyglobby/-/tinyglobby-0.2.15.tgz", + "integrity": "sha512-j2Zq4NyQYG5XMST4cbs02Ak8iJUdxRM0XI5QyxXuZOzKOINmWurp3smXu3y5wDcJrptwpSjgXHzIQxR0omXljQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "fdir": "^6.5.0", + "picomatch": "^4.0.3" + }, + "engines": { + "node": ">=12.0.0" + }, + "funding": { + "url": "https://github.com/sponsors/SuperchupuDev" + } + }, + "node_modules/totalist": { + "version": "3.0.1", + "resolved": "https://registry.npmjs.org/totalist/-/totalist-3.0.1.tgz", + "integrity": "sha512-sf4i37nQ2LBx4m3wB74y+ubopq6W/dIzXg0FDGjsYnZHVa1Da8FH853wlL2gtUhg+xJXjfk3kUZS3BRoQeoQBQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=6" + } + }, + "node_modules/typescript": { + "version": "5.9.3", + "resolved": "https://registry.npmjs.org/typescript/-/typescript-5.9.3.tgz", + "integrity": "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw==", + "dev": true, + "license": "Apache-2.0", + "bin": { + "tsc": "bin/tsc", + "tsserver": "bin/tsserver" + }, + "engines": { + "node": ">=14.17" + } + }, + "node_modules/vite": { + "version": "7.3.1", + "resolved": "https://registry.npmjs.org/vite/-/vite-7.3.1.tgz", + "integrity": "sha512-w+N7Hifpc3gRjZ63vYBXA56dvvRlNWRczTdmCBBa+CotUzAPf5b7YMdMR/8CQoeYE5LX3W4wj6RYTgonm1b9DA==", + "dev": true, + "license": "MIT", + "dependencies": { + "esbuild": "^0.27.0", + "fdir": "^6.5.0", + "picomatch": "^4.0.3", + "postcss": "^8.5.6", + "rollup": "^4.43.0", + "tinyglobby": "^0.2.15" + }, + "bin": { + "vite": "bin/vite.js" + }, + "engines": { + "node": "^20.19.0 || >=22.12.0" + }, + "funding": { + "url": "https://github.com/vitejs/vite?sponsor=1" + }, + "optionalDependencies": { + "fsevents": "~2.3.3" + }, + "peerDependencies": { + "@types/node": "^20.19.0 || >=22.12.0", + "jiti": ">=1.21.0", + "less": "^4.0.0", + "lightningcss": "^1.21.0", + "sass": "^1.70.0", + "sass-embedded": "^1.70.0", + "stylus": ">=0.54.8", + "sugarss": "^5.0.0", + "terser": "^5.16.0", + "tsx": "^4.8.1", + "yaml": "^2.4.2" + }, + "peerDependenciesMeta": { + "@types/node": { + "optional": true + }, + "jiti": { + "optional": true + }, + "less": { + "optional": true + }, + "lightningcss": { + "optional": true + }, + "sass": { + "optional": true + }, + "sass-embedded": { + "optional": true + }, + "stylus": { + "optional": true + }, + "sugarss": { + "optional": true + }, + "terser": { + "optional": true + }, + "tsx": { + "optional": true + }, + "yaml": { + "optional": true + } + } + }, + "node_modules/vitefu": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/vitefu/-/vitefu-1.1.1.tgz", + "integrity": "sha512-B/Fegf3i8zh0yFbpzZ21amWzHmuNlLlmJT6n7bu5e+pCHUKQIfXSYokrqOBGEMMe9UG2sostKQF9mml/vYaWJQ==", + "dev": true, + "license": "MIT", + "workspaces": [ + "tests/deps/*", + "tests/projects/*", + "tests/projects/workspace/packages/*" + ], + "peerDependencies": { + "vite": "^3.0.0 || ^4.0.0 || ^5.0.0 || ^6.0.0 || ^7.0.0-beta.0" + }, + "peerDependenciesMeta": { + "vite": { + "optional": true + } + } + }, + "node_modules/zimmerframe": { + "version": "1.1.4", + "resolved": "https://registry.npmjs.org/zimmerframe/-/zimmerframe-1.1.4.tgz", + "integrity": "sha512-B58NGBEoc8Y9MWWCQGl/gq9xBCe4IiKM0a2x7GZdQKOW5Exr8S1W24J6OgM1njK8xCRGvAJIL/MxXHf6SkmQKQ==", + "dev": true, + "license": "MIT" + } + } +} diff --git a/stacks/k8s-portal/modules/k8s-portal/files/package.json b/stacks/k8s-portal/modules/k8s-portal/files/package.json new file mode 100644 index 00000000..6018d9b1 --- /dev/null +++ b/stacks/k8s-portal/modules/k8s-portal/files/package.json @@ -0,0 +1,24 @@ +{ + "name": "k8s-portal", + "private": true, + "version": "0.0.1", + "type": "module", + "scripts": { + "dev": "vite dev", + "build": "vite build", + "preview": "vite preview", + "prepare": "svelte-kit sync || echo ''", + "check": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json", + "check:watch": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json --watch" + }, + "devDependencies": { + "@sveltejs/adapter-auto": "^7.0.0", + "@sveltejs/adapter-node": "^5.5.3", + "@sveltejs/kit": "^2.50.2", + "@sveltejs/vite-plugin-svelte": "^6.2.4", + "svelte": "^5.49.2", + "svelte-check": "^4.3.6", + "typescript": "^5.9.3", + "vite": "^7.3.1" + } +} From 25a39fd54e3da34ced7b8ae9c0b9114dd5cb7aa6 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 15:38:42 +0000 Subject: [PATCH 11/36] k8s-portal: wire private-ghcr pull (allowlist + imagePullSecrets) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit k8s-portal was the last in-cluster image build; it now builds on GHA and pushes ghcr.io/viktorbarzin/k8s-portal:latest, which is PRIVATE (infra repo default). To pull it: add k8s-portal to the sync-ghcr-credentials Kyverno allowlist (clones the ghcr-credentials Secret into the namespace) and reference that secret via imagePullSecrets on the deployment — same wiring as tripit/recruiter-responder. Completes the no-local-builds migration so nothing builds container images on the cluster anymore (ADR-0002). Co-Authored-By: Claude Fable 5 --- stacks/k8s-portal/modules/k8s-portal/main.tf | 7 +++++++ stacks/kyverno/modules/kyverno/ghcr-credentials.tf | 4 ++++ 2 files changed, 11 insertions(+) diff --git a/stacks/k8s-portal/modules/k8s-portal/main.tf b/stacks/k8s-portal/modules/k8s-portal/main.tf index 908fca49..e32fd519 100644 --- a/stacks/k8s-portal/modules/k8s-portal/main.tf +++ b/stacks/k8s-portal/modules/k8s-portal/main.tf @@ -75,6 +75,13 @@ resource "kubernetes_deployment" "k8s_portal" { } spec { + # GHCR pull secret: the ghcr-credentials Secret in this namespace is + # cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy + # (allowlisted private-ghcr namespaces only — ADR-0002). Source of + # truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf. + image_pull_secrets { + name = "ghcr-credentials" + } container { name = "portal" image = "ghcr.io/viktorbarzin/k8s-portal:latest" diff --git a/stacks/kyverno/modules/kyverno/ghcr-credentials.tf b/stacks/kyverno/modules/kyverno/ghcr-credentials.tf index 6af4220f..07a1df85 100644 --- a/stacks/kyverno/modules/kyverno/ghcr-credentials.tf +++ b/stacks/kyverno/modules/kyverno/ghcr-credentials.tf @@ -27,6 +27,10 @@ locals { # openclaw's install-recruiter-plugin init container pulls the PRIVATE # ghcr.io/viktorbarzin/recruiter-responder:latest image (infra#27). "openclaw", + # k8s-portal: last in-cluster image build, migrated to GHA→ghcr (ADR-0002, + # "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE + # (infra repo default); the deployment references the cloned secret. + "k8s-portal", ] } From 72982683bc76f6e9ff81331010eb5a2f1537eeff Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 16:10:56 +0000 Subject: [PATCH 12/36] docs(CLAUDE.md): k8s-portal now GHA->ghcr, not a Woodpecker build k8s-portal was the last in-cluster image builder. Its .woodpecker/k8s-portal.yml was deleted; it now builds on GHA (build-k8s-portal.yml) -> PRIVATE ghcr, pulled via the Kyverno ghcr-credentials allowlist and deployed by Keel. Fix the CI/CD section: drop k8s-portal from the Woodpecker-pipelines list (stale), move it from 'already on GHA' to the infra-owned private-ghcr images, and add it to the PRIVATE ghcr allowlist roster. Completes the no-local-builds migration. Co-Authored-By: Claude Fable 5 --- .claude/CLAUDE.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 37ab99f3..1a81118b 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -129,14 +129,14 @@ beadboard, nextcloud-todos, claude-agent-service, **claude-memory-mcp** (GHA → ghcr, NOT DockerHub), kms-website, Freedify, instagram-poster, payslip-ingest, broker-sync (image `wealthfolio-sync`), fire-planner, recruiter-responder, x402-gateway — plus tripit. Earlier public-repo apps already on GHA (Website, -k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify, +apple-health-data, audiblez-web, plotting-book, insta2spotify, audiobook-search, council-complaints) now also land on ghcr. - **PUBLIC ghcr packages:** beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator. - **PRIVATE ghcr:** f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, - infra-ci. Pulled via the Kyverno-synced `ghcr-credentials` allowlist + infra-ci, k8s-portal. Pulled via the Kyverno-synced `ghcr-credentials` allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; NOT cluster-wide; cred = Vault `secret/viktor/ghcr_pull_token`, an alias of the admin `github_pat` — GitHub has no token-mint API, swap the alias value if a scoped token is ever @@ -147,9 +147,11 @@ repo's own `.github/workflows/` (added to the GitHub lineage via PR; the github↔forgejo divergence was deliberately NOT reconciled): `build-chrome-service-novnc.yml` + `build-android-emulator.yml` → public ghcr; `build-cli.yml` → DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli`; -`build-infra-ci.yml` → `ghcr.io/viktorbarzin/infra-ci`. **infra-ci** is the image -the `.woodpecker/default.yml` apply step + `drift-detection.yml` run in (proven -by pipelines 165/166). chatterbox-tts is already built by tripit's GHA → ghcr. +`build-infra-ci.yml` → `ghcr.io/viktorbarzin/infra-ci`; `build-k8s-portal.yml` → +PRIVATE `ghcr.io/viktorbarzin/k8s-portal` (Keel-deployed; the LAST in-cluster +Woodpecker build, migrated 2026-06-13 — completes "no local builds"). **infra-ci** +is the image the `.woodpecker/default.yml` apply step + `drift-detection.yml` run +in (proven by pipelines 165/166). chatterbox-tts is already built by tripit's GHA → ghcr. The Woodpecker `build-ci-image.yml` + `build-cli.yml` pipelines were REMOVED; infra-ci break-glass is a manual `.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM). @@ -162,9 +164,11 @@ stays DRY_RUN. Pull-through caches on `10.0.20.10` are unchanged. Runbook: **Woodpecker now runs only:** per-app `deploy.yml` (manual, `kubectl set image`), `default.yml` (terragrunt apply), `renew-tls.yml` (certbot), maintenance crons (drift-detection, provision-user, registry-config-sync, -pve-nfs-exports-sync, issue-automation, postmortem-todos, k8s-portal), and the +pve-nfs-exports-sync, issue-automation, postmortem-todos), and the manual `breakglass-infra-ci.yml`. **No build/test pipeline on any repo — do not -(re)introduce one.** +(re)introduce one.** (`.woodpecker/k8s-portal.yml`, the last in-cluster image +build, was removed 2026-06-13 — k8s-portal now builds on GHA → ghcr, see +Infra-owned images above.) **Decommissioned (issue #31):** travel_blog (stack destroyed + dir removed), 6 dead builders' pipelines (terminal-lobby, webhook-handler, hmrc-sync, From a6381b8cf85e43161cb700e4b88325c23772337f Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 17:16:47 +0000 Subject: [PATCH 13/36] forgejo: custom 8Gi ResourceQuota (was pegged at the 4Gi tier cap) Yesterday's Forgejo 3Gi->4Gi OOM fix pushed its tier-3-edge namespace quota (requests.memory=4Gi) to 100%, firing KubeQuotaAlmostFull + the healthcheck resourcequota check. Forgejo is the git + OCI-registry backbone and legitimately needs ~4Gi, so the edge tier's 4Gi ceiling is too tight. Opt the namespace out of the auto tier quota (resource-governance/custom-quota=true) and define a forgejo-specific ResourceQuota at requests.memory=8Gi, so the 4Gi pod sits at ~50% with headroom. Same opt-out pattern dbaas uses. Re-tiering was rejected: tier 1-cluster is also 4Gi, and 0-core (8Gi) would over-classify Forgejo's priority/eviction. Co-Authored-By: Claude Opus 4.8 --- stacks/forgejo/main.tf | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/stacks/forgejo/main.tf b/stacks/forgejo/main.tf index f9adb955..d271ffa0 100644 --- a/stacks/forgejo/main.tf +++ b/stacks/forgejo/main.tf @@ -11,6 +11,12 @@ resource "kubernetes_namespace" "forgejo" { "istio-injection" : "disabled" tier = local.tiers.edge "keel.sh/enrolled" = "true" + # Opt out of the auto-generated tier-3-edge ResourceQuota (caps + # requests.memory at 4Gi). Forgejo's own pod requests 4Gi (the + # git + OCI-registry backbone, Guaranteed QoS), which pegged that + # tier quota at 100% and fired KubeQuotaAlmostFull. The + # forgejo-specific quota below gives headroom. Same pattern as dbaas. + "resource-governance/custom-quota" = "true" } } lifecycle { @@ -19,6 +25,26 @@ resource "kubernetes_namespace" "forgejo" { } } +# Custom ResourceQuota — replaces the tier-3-edge auto quota (opted out via the +# resource-governance/custom-quota label above). requests.memory is 8Gi so the +# 4Gi Forgejo pod sits at ~50% (clears KubeQuotaAlmostFull + the healthcheck +# resourcequota check) with room for a transient migration/sidecar pod. To +# raise Forgejo's memory limit past 4Gi later, bump requests.memory here too. +resource "kubernetes_resource_quota" "forgejo" { + metadata { + name = "forgejo-quota" + namespace = kubernetes_namespace.forgejo.metadata[0].name + } + spec { + hard = { + "requests.cpu" = "4" + "requests.memory" = "8Gi" + "limits.memory" = "32Gi" + pods = "30" + } + } +} + module "tls_secret" { source = "../../modules/kubernetes/setup_tls_secret" namespace = kubernetes_namespace.forgejo.metadata[0].name From e6699ed20bf9407ca9472f1d26a7d00ba943b953 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sat, 13 Jun 2026 20:54:14 +0000 Subject: [PATCH 14/36] uptime-kuma: retry Kuma login in monitor-sync jobs (intermittent socket.io timeout) The internal + external monitor-sync CronJobs intermittently failed with socketio.exceptions.TimeoutError on api.login(), firing JobFailed -> Slack noise (and leaving monitor sync stale). Kuma 2.3.2 itself is healthy (1/1, 30m CPU); its single Node event loop just briefly stalls under ~300 monitors so the socket.io login handshake occasionally exceeds the client timeout. Wrap connect+login in a 5-attempt / 15s-backoff retry (disconnecting the half-open client between tries) so a transient stall no longer fails the whole job. Applied to both sync scripts. Co-Authored-By: Claude Opus 4.8 --- .../uptime-kuma/modules/uptime-kuma/main.tf | 46 +++++++++++++++++-- 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/stacks/uptime-kuma/modules/uptime-kuma/main.tf b/stacks/uptime-kuma/modules/uptime-kuma/main.tf index faa7d2d3..0921bc24 100644 --- a/stacks/uptime-kuma/modules/uptime-kuma/main.tf +++ b/stacks/uptime-kuma/modules/uptime-kuma/main.tf @@ -503,8 +503,27 @@ except (urllib.error.URLError, OSError, KeyError, ValueError) as e: print(f"Loaded {len(targets)} external monitor targets (source={source})") -api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2) -api.login("admin", UPTIME_KUMA_PASS) +api = None +for _login_try in range(1, 6): + try: + api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2) + api.login("admin", UPTIME_KUMA_PASS) + break + except Exception as _login_err: + # kuma 2.x's single Node event loop intermittently stalls under its + # ~300 monitors, so the socket.io login handshake times out. Retry a + # few times across a ~60s window to ride out the stall instead of + # failing the whole sync job (which fired JobFailed -> Slack noise). + print(f"WARN: Kuma login attempt {_login_try}/5 failed: {_login_err!r}") + if api is not None: + try: + api.disconnect() + except Exception: + pass + api = None + if _login_try == 5: + raise + time.sleep(15) monitors = api.get_monitors() existing_external = {} @@ -818,8 +837,27 @@ UPTIME_KUMA_PASS = os.environ["UPTIME_KUMA_PASSWORD"] with open("/config/targets.json") as f: targets = json.load(f) -api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2) -api.login("admin", UPTIME_KUMA_PASS) +api = None +for _login_try in range(1, 6): + try: + api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2) + api.login("admin", UPTIME_KUMA_PASS) + break + except Exception as _login_err: + # kuma 2.x's single Node event loop intermittently stalls under its + # ~300 monitors, so the socket.io login handshake times out. Retry a + # few times across a ~60s window to ride out the stall instead of + # failing the whole sync job (which fired JobFailed -> Slack noise). + print(f"WARN: Kuma login attempt {_login_try}/5 failed: {_login_err!r}") + if api is not None: + try: + api.disconnect() + except Exception: + pass + api = None + if _login_try == 5: + raise + time.sleep(15) existing = {m["name"]: m for m in api.get_monitors()} From 05bec26d09afe017fbc448e3e0c22f7e0ed7562f Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 04:01:00 +0000 Subject: [PATCH 15/36] health: internal test-access ingress + DEV_AUTH_EMAIL (ADR-0008) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add health-test.viktorbarzin.lan (auth=none, allow_local_access_only, anti-AI off) pointing at the same health deployment, plus a DEV_AUTH_EMAIL=vbarzin@gmail.com env on the container. Lets automated E2E / Playwright / manual screenshots reach the live app without the Authentik SSO redirect, for testing — while the public health.viktorbarzin.me ingress stays auth=required (forward-auth fails closed, so the public path always carries the real X-authentik-email header and never hits the DEV_AUTH_EMAIL fallback). LAN-only, no public exposure. Decision recorded in health repo ADR-0008. Co-Authored-By: Claude Opus 4.8 --- stacks/health/main.tf | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/stacks/health/main.tf b/stacks/health/main.tf index 979b2dd0..5b9ae090 100644 --- a/stacks/health/main.tf +++ b/stacks/health/main.tf @@ -9,7 +9,7 @@ resource "kubernetes_namespace" "health" { metadata { name = "health" labels = { - tier = local.tiers.aux + tier = local.tiers.aux "keel.sh/enrolled" = "true" } } @@ -128,6 +128,15 @@ resource "kubernetes_deployment" "health" { name = "COOKIE_SECURE" value = "true" } + env { + # ADR-0008 (health repo): identity for the internal LAN test host. + # Only reached when no X-authentik-email header is present — i.e. via + # the auth="none" test ingress below. The public host's forward-auth + # fails closed, so requests arriving there always carry the real + # header and never fall back to this value. + name = "DEV_AUTH_EMAIL" + value = "vbarzin@gmail.com" + } volume_mount { name = "uploads" @@ -207,6 +216,30 @@ module "ingress" { } } +# https://health-test.viktorbarzin.lan — internal LAN-only test host for +# automated/E2E testing + manual screenshots without the Authentik SSO dance +# (ADR-0008). Same `health` deployment; acts as DEV_AUTH_EMAIL=vbarzin@gmail.com. +module "ingress_test" { + source = "../../modules/kubernetes/ingress_factory" + # auth = "none": LAN-only (allow_local_access_only) test host — no public + # exposure; the public health.viktorbarzin.me ingress above stays + # auth="required". No user data gate here by design — it serves the real app + # as DEV_AUTH_EMAIL since no X-authentik-email is injected (ADR-0008). + auth = "none" + namespace = kubernetes_namespace.health.metadata[0].name + name = "health-test" + root_domain = "viktorbarzin.lan" + service_name = kubernetes_service.health.metadata[0].name + tls_secret_name = var.tls_secret_name + allow_local_access_only = true + ssl_redirect = false + max_body_size = "100m" + anti_ai_scraping = false + extra_annotations = { + "gethomepage.dev/enabled" = "false" + } +} + resource "kubernetes_manifest" "external_secret_db" { manifest = { apiVersion = "external-secrets.io/v1beta1" From 6dc77f46128474fe141a178c2f0348e2d00318cc Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 09:11:22 +0000 Subject: [PATCH 16/36] uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review) Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand). Co-Authored-By: Claude Opus 4.8 --- stacks/uptime-kuma/CONTEXT.md | 29 ++++++++++++ .../0001-uptime-kuma-sizing-and-placement.md | 45 +++++++++++++++++++ 2 files changed, 74 insertions(+) create mode 100644 stacks/uptime-kuma/CONTEXT.md create mode 100644 stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md diff --git a/stacks/uptime-kuma/CONTEXT.md b/stacks/uptime-kuma/CONTEXT.md new file mode 100644 index 00000000..e8d2c981 --- /dev/null +++ b/stacks/uptime-kuma/CONTEXT.md @@ -0,0 +1,29 @@ +# Uptime Kuma — Context + +Glossary for the uptime-kuma monitoring context. Terms only — no implementation +detail. Decisions live in `docs/adr/`. + +## Glossary + +**Active check (poll)** — Uptime Kuma actively probes a target on an interval +(HTTP / TCP / ping / DB). This is *polling*, not "scraping." Prometheus *scrapes* +exporters; Kuma *polls* targets. (Note: Prometheus does **not** scrape Kuma — a +separate monitoring lane.) + +**Monitor** — one configured target plus its check definition. + +**Internal monitor** — probes a service on its in-cluster address +(`*.svc.cluster.local`). Answers "is the service itself healthy?" + +**`[External]` monitor** — probes a service via its full public path +(DNS → Cloudflare → cloudflared tunnel → Traefik). Answers "is the service +reachable the way users reach it?" Maintained one-per-externally-reachable-service +by deliberate choice (see ADR-0001). + +**Heartbeat** — one recorded check result (up/down + latency), persisted to the +datastore. + +**External-access divergence** — the condition where a service is healthy +*internally* but its `[External]` path is down — i.e. the shared +Cloudflare/tunnel/Traefik path is broken while the service itself is fine. +Surfaced by the `ExternalAccessDivergence` alert. diff --git a/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md b/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md new file mode 100644 index 00000000..80db84ac --- /dev/null +++ b/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md @@ -0,0 +1,45 @@ +# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement + +## Status +Accepted (2026-06-13) + +## Context +A review was prompted by a suspicion that Kuma was "scraping too much / causing +unnecessary traffic," itself triggered by a socket.io login-timeout incident on +the monitor-sync CronJobs. Measured state at review time: + +- **227 active monitors**; 209 of them at 300s intervals; **~1 check/sec** aggregate. +- Datastore: the **shared `mysql.dbaas`** (MariaDB), **~77 MB**, ~1 heartbeat + write/sec, 30-day retention. +- **122 `[External]` monitors** (full public path) + ~105 internal. + +The data did **not** support a load problem — Kuma is already lean. The +login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event +loop briefly stalling), fixed separately by wrapping login in a retry — not a +load issue. + +## Decisions +1. **Keep Kuma as-is; do not reflexively cut monitors or intervals.** Poll rate + (~1/s) and DB footprint (77 MB) are modest. +2. **`[External]` monitors stay per-service** (one per externally-reachable + service), **not** a small canary set. Rejected cutting to ~6-10 canaries: + although the Cloudflare → tunnel → Traefik path is shared infra that fails as a + unit, per-service external probes also catch *single-service* external + misconfig (one service's DNS / auth carve-out / route), which canaries miss. + The ~35k Cloudflare requests/day this generates is accepted for that coverage. +3. **Datastore stays on the shared `mysql.dbaas`.** Rejected moving to + self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the + single-instance MySQL it also helps monitor, including during that MySQL's + 8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as + low-impact for now. + +## Consequences +- All three decisions are **cheap to reverse**; revisit if measured load on + `mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This + ADR exists mainly so that review isn't re-run from scratch. +- **Known gap:** the *internal* monitor-sync creates/updates monitors but does + **not** prune orphans (the external sync does). Internal monitors for deleted + services linger and need periodic manual cleanup — e.g. the stale + "Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted + by hand on 2026-06-13. A *scoped* internal-prune (only deleting monitors the + sync owns, never hand-made ones) is a possible future improvement. From 086ff859114b3ca6716b40a894deb0c56e1b579a Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 13:01:14 +0000 Subject: [PATCH 17/36] health: dedicated 100/1000 rate limit for the redesigned SPA MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Viktor hit 429s browsing the redesigned health app. The default shared limiter is 10 req/s / burst 50, but each page load is the shell (JS chunks + two self-hosted Geist woff2) plus a 5-8 call API burst, so fast tab-to-tab navigation from one client IP overruns burst 50 — Traefik 429s the tail and the affected cards/pages render empty. Give health its own limiter (average 100, burst 1000) and skip the default, exactly as tripit/immich/actualbudget/ha-sofia already do for the same parallel-burst pattern. Attached via the ingress_factory escape hatch (skip_default_rate_limit + extra_middlewares). Co-Authored-By: Claude Opus 4.8 --- stacks/health/main.tf | 5 ++++ stacks/traefik/modules/traefik/middleware.tf | 25 ++++++++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/stacks/health/main.tf b/stacks/health/main.tf index 5b9ae090..df3a68fe 100644 --- a/stacks/health/main.tf +++ b/stacks/health/main.tf @@ -206,6 +206,11 @@ module "ingress" { name = "health" tls_secret_name = var.tls_secret_name max_body_size = "100m" + # The redesigned SPA bursts well past the default 10/50 limiter on each page + # load (shell + fonts + a 5-8 call API burst). Swap the shared limiter for a + # health-specific one (100/1000), mirroring tripit/immich/actualbudget. + skip_default_rate_limit = true + extra_middlewares = ["health-rate-limit@kubernetescrd"] extra_annotations = { "gethomepage.dev/enabled" = "true" "gethomepage.dev/name" = "Health" diff --git a/stacks/traefik/modules/traefik/middleware.tf b/stacks/traefik/modules/traefik/middleware.tf index d2749ce0..3d26ecd2 100644 --- a/stacks/traefik/modules/traefik/middleware.tf +++ b/stacks/traefik/modules/traefik/middleware.tf @@ -344,6 +344,31 @@ resource "kubernetes_manifest" "middleware_tripit_rate_limit" { depends_on = [helm_release.traefik] } +# Health-specific rate limit. The redesigned, data-dense SPA loads the shell +# (JS chunks + two self-hosted Geist woff2) plus a 5-8 call API burst per page, +# and fast tab-to-tab navigation from one client IP blows past the default +# 10/50 limiter — 429ing the tail so cards/pages render empty (fifth instance +# of the burst pattern, after ha-sofia, ActualBudget, noVNC and tripit). Burst +# absorbs a couple of full page loads back-to-back. +resource "kubernetes_manifest" "middleware_health_rate_limit" { + manifest = { + apiVersion = "traefik.io/v1alpha1" + kind = "Middleware" + metadata = { + name = "health-rate-limit" + namespace = kubernetes_namespace.traefik.metadata[0].name + } + spec = { + rateLimit = { + average = 100 + burst = 1000 + } + } + } + + depends_on = [helm_release.traefik] +} + # Compress responses to clients at the entrypoint level (outermost). # Applied at websecure entrypoint so all responses get compressed. # Uses includedContentTypes (whitelist) instead of excludedContentTypes: From 2df6ebf305a3ce3601054e734725ae2e8fb40ee1 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 17:43:08 +0000 Subject: [PATCH 18/36] health: fix middleware ref namespace prefix (restore site from 404) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit My previous commit referenced the new limiter as `health-rate-limit@kubernetescrd`, omitting the namespace prefix. Traefik CRD middleware refs are `-@kubernetescrd`, and the Middleware lives in the `traefik` ns, so the router couldn't resolve it — Traefik failed the whole health.viktorbarzin.me router and returned 404 on every path (the app + pod were healthy throughout; verified via port-forward). Correct it to `traefik-health-rate-limit@kubernetescrd`, matching the working traefik-tripit-rate-limit / traefik-actualbudget-rate-limit references. Co-Authored-By: Claude Opus 4.8 --- stacks/health/main.tf | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/stacks/health/main.tf b/stacks/health/main.tf index df3a68fe..8d21d33b 100644 --- a/stacks/health/main.tf +++ b/stacks/health/main.tf @@ -209,8 +209,12 @@ module "ingress" { # The redesigned SPA bursts well past the default 10/50 limiter on each page # load (shell + fonts + a 5-8 call API burst). Swap the shared limiter for a # health-specific one (100/1000), mirroring tripit/immich/actualbudget. + # The ref MUST carry the middleware's namespace prefix: the CRD lives in the + # `traefik` ns, so it's `traefik-health-rate-limit@kubernetescrd` (same form as + # traefik-tripit-rate-limit). Without the prefix Traefik can't resolve it and + # 404s the whole router. skip_default_rate_limit = true - extra_middlewares = ["health-rate-limit@kubernetescrd"] + extra_middlewares = ["traefik-health-rate-limit@kubernetescrd"] extra_annotations = { "gethomepage.dev/enabled" = "true" "gethomepage.dev/name" = "Health" From fe1f8d62e74cec9d0ec8257e3c2cd17238c5df02 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 17:44:10 +0000 Subject: [PATCH 19/36] tripit: re-apply tripit stack to land CITY_IMAGE_PROVIDER=wikipedia MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The commit that enabled real city cover photos (a69847a0, CITY_IMAGE_PROVIDER=wikipedia, #47) was committed to master but its CI run skipped the tripit stack apply (changed-stack diff race — same class as the prior "re-apply after pipeline race" fixes). The env never landed in-cluster, so the provider stayed on its fake 1x1-PNG default and every trip/stay cover rendered blank/placeholder in prod. This comment touch forces CI to re-apply the tripit stack; terraform then reconciles the drift (desired HCL already has the env) so the deployment picks up CITY_IMAGE_PROVIDER=wikipedia. Co-Authored-By: Claude Opus 4.8 --- stacks/tripit/main.tf | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/stacks/tripit/main.tf b/stacks/tripit/main.tf index cd15012b..b26beb94 100644 --- a/stacks/tripit/main.tf +++ b/stacks/tripit/main.tf @@ -125,6 +125,11 @@ locals { # (older images crash-loop on the unknown enum) — landed after that # image rolled out, same hold-order as FARE/CALENDAR/RESEARCH above. CITY_IMAGE_PROVIDER = "wikipedia" + # Re-applied 2026-06-14: a69847a0 (the commit that added this) was never + # terraform-applied — its CI run skipped the tripit stack (changed-stack + # diff race), so the env never landed in-cluster and the provider fell back + # to the fake 1x1-PNG, leaving every trip/stay cover blank. This touch forces + # the tripit stack to re-apply and reconcile the drift. # Tour-guide content pipeline (tripit#24/#25): these three default to `fake` # in tripit's config, which is what shipped dark on 2026-06-08 — prod only # ever showed the placeholder "Sight 1". Real providers: Wikipedia GeoSearch From 0bfa6f0774ec3d975cfef833658f8e25a386a1d0 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 19:28:25 +0000 Subject: [PATCH 20/36] feat(anisette): self-hosted Apple anisette server for SideStore (infra #40) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Deploy a small stateless anisette-data server so the TripIt iOS Shell can be sideloaded with SideStore using a free Apple ID, without brokering the Apple-ID auth dance through a public third-party anisette server (which would see every login). SideStore points at a stable internal endpoint we control. - Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub releases / semver / sha tags), so pinned by manifest digest instead of a tag per the "never :latest" rule. Pulled from DockerHub via the registry-VM pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so a new upstream build prompts a digest re-pin. - Stateless: emptyDir backs the provisioning-library cache dir (regenerable download; upstream issue #23 means it doesn't preserve client auth across restarts anyway) — no PVC, no Vault secret. - Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none, allow_local_access_only, ssl_redirect off) — SideStore is a native client that can't do the Authentik cookie dance, same reasoning as android-emulator's adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never publicly exposed. Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service catalog updated. Co-Authored-By: Claude Opus 4.8 --- .claude/reference/service-catalog.md | 1 + stacks/anisette/main.tf | 171 +++++++++++++++++++++++++++ stacks/anisette/secrets | 1 + stacks/anisette/terragrunt.hcl | 8 ++ 4 files changed, 181 insertions(+) create mode 100644 stacks/anisette/main.tf create mode 120000 stacks/anisette/secrets create mode 100644 stacks/anisette/terragrunt.hcl diff --git a/.claude/reference/service-catalog.md b/.claude/reference/service-catalog.md index ec78beac..242d1189 100644 --- a/.claude/reference/service-catalog.md +++ b/.claude/reference/service-catalog.md @@ -42,6 +42,7 @@ | webhook_handler | Webhook processing | webhook_handler | | tuya-bridge | Smart home bridge | tuya-bridge | | android-emulator | Shared Android 16 test emulator (adb 10.0.20.200:5555, noVNC android-emulator.viktorbarzin.lan) | android-emulator | +| anisette | Self-hosted Apple anisette-data server (Dadoum/anisette-v3-server, digest-pinned) for sideloading the TripIt iOS Shell via SideStore; internal-only http://anisette.viktorbarzin.lan, auth=none, LAN-only, stateless | anisette | | dawarich | Location history | dawarich | | owntracks | Location tracking | owntracks | | nextcloud | File sync/share | nextcloud | diff --git a/stacks/anisette/main.tf b/stacks/anisette/main.tf new file mode 100644 index 00000000..a8fbb8ec --- /dev/null +++ b/stacks/anisette/main.tf @@ -0,0 +1,171 @@ +# anisette — self-hosted Apple anisette-data server for SideStore/AltStore. +# +# Purpose (infra issue #40): the TripIt iOS Shell is sideloaded with SideStore +# using a free Apple ID. SideStore needs an "anisette" server to broker the +# Apple-ID auth dance; the public community anisette servers see every login, +# so we run our own. Stateless HTTP service on a stable INTERNAL endpoint +# (anisette.viktorbarzin.lan) that SideStore points at. +# +# Image: Dadoum/anisette-v3-server — the de-facto standard anisette-v3 server +# for SideStore/AltStore (the same project SideStore's own docs point at). +# Upstream publishes ONLY a mutable :latest tag (no GitHub releases, no semver, +# no date/sha tags — verified 2026-06-14), so we pin by MANIFEST DIGEST instead +# (immutable, honours the "never :latest" rule). DockerHub is pulled +# transparently via the registry-VM pull-through cache, same as echo/cyberchef. +# To bump: `docker buildx imagetools inspect dadoum/anisette-v3-server:latest`, +# then replace the digest below. +# +# Stateless: the container caches Apple provisioning libraries under +# /home/Alcoholic/.config/anisette-v3/lib (a regenerable download — re-fetched +# if absent — and per upstream issue #23 it does NOT preserve client auth across +# restarts anyway). So an emptyDir is the honest fit: keeps that path writable +# without taking on a backup-pipeline obligation. No PVC, no Vault secret. + +variable "tls_secret_name" { + type = string + sensitive = true +} + +resource "kubernetes_namespace" "anisette" { + metadata { + name = "anisette" + labels = { + "istio-injection" : "disabled" + tier = local.tiers.aux + } + } + lifecycle { + # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace + ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]] + } +} + +module "tls_secret" { + source = "../../modules/kubernetes/setup_tls_secret" + namespace = kubernetes_namespace.anisette.metadata[0].name + tls_secret_name = var.tls_secret_name +} + +resource "kubernetes_deployment" "anisette" { + metadata { + name = "anisette" + namespace = kubernetes_namespace.anisette.metadata[0].name + labels = { + app = "anisette" + tier = local.tiers.aux + } + } + spec { + replicas = 1 + selector { + match_labels = { + app = "anisette" + } + } + template { + metadata { + labels = { + app = "anisette" + } + annotations = { + # Diun notify-only watch. Upstream tags only :latest, so watch the + # digest of :latest rather than a semver pattern. + "diun.enable" = "true" + "diun.watch_repo" = "false" + "diun.include_tags" = "^latest$" + } + } + spec { + container { + # Pinned by digest — upstream ships only a mutable :latest (no tags). + image = "dadoum/anisette-v3-server@sha256:1e20384985d3c49965f444bef39d627768dacc39ea0dca91f2a535edb7591ba3" + name = "anisette" + port { + name = "http" + container_port = 6969 + } + # The image runs as the non-root user "Alcoholic" and writes its + # provisioning-library cache here; back it with an emptyDir so the + # path is writable (stateless — wiped on restart, re-downloaded). + volume_mount { + name = "provisioning-cache" + mount_path = "/home/Alcoholic/.config/anisette-v3/lib" + } + resources { + requests = { + cpu = "10m" + memory = "128Mi" + } + limits = { + memory = "128Mi" + } + } + readiness_probe { + http_get { + path = "/" + port = 6969 + } + period_seconds = 15 + initial_delay_seconds = 5 + } + liveness_probe { + http_get { + path = "/" + port = 6969 + } + period_seconds = 30 + failure_threshold = 6 + } + } + volume { + name = "provisioning-cache" + empty_dir {} + } + } + } + } + lifecycle { + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + ] + } +} + +resource "kubernetes_service" "anisette" { + metadata { + name = "anisette" + namespace = kubernetes_namespace.anisette.metadata[0].name + labels = { + "app" = "anisette" + } + } + spec { + selector = { + app = "anisette" + } + port { + name = "http" + port = "80" + target_port = "6969" + } + } +} + +module "ingress" { + source = "../../modules/kubernetes/ingress_factory" + # auth = "none": SideStore is a native iOS client — it can't replay the + # Authentik forward-auth cookie dance, so Authentik would break it (same + # reasoning as android-emulator's adb). Internal-only: anisette.viktorbarzin.lan, + # allow_local_access_only locks it to the LAN, and it brokers no user data of + # ours (it just relays Apple-ID anisette data). Never publicly exposed. + auth = "none" + namespace = kubernetes_namespace.anisette.metadata[0].name + name = "anisette" + root_domain = "viktorbarzin.lan" + tls_secret_name = var.tls_secret_name + allow_local_access_only = true + ssl_redirect = false + extra_annotations = { + "gethomepage.dev/enabled" = "false" + } +} diff --git a/stacks/anisette/secrets b/stacks/anisette/secrets new file mode 120000 index 00000000..ca54a7cf --- /dev/null +++ b/stacks/anisette/secrets @@ -0,0 +1 @@ +../../secrets \ No newline at end of file diff --git a/stacks/anisette/terragrunt.hcl b/stacks/anisette/terragrunt.hcl new file mode 100644 index 00000000..0d1c8e53 --- /dev/null +++ b/stacks/anisette/terragrunt.hcl @@ -0,0 +1,8 @@ +include "root" { + path = find_in_parent_folders() +} + +dependency "platform" { + config_path = "../platform" + skip_outputs = true +} From 96addf65b40174715416ed18e41e44a2bdb97894 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 19:47:05 +0000 Subject: [PATCH 21/36] fix(anisette): docker.io/ image prefix to pass Kyverno require-trusted-registries MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit First apply was denied at admission — a bare dadoum/anisette-v3-server@sha256 ref isn't in the trusted-registries allowlist (only enumerated DockerHub user-repo prefixes are). docker.io/* IS allowlisted, so use the explicit registry prefix; still pulls via the 10.0.20.10 pull-through cache. Co-Authored-By: Claude Opus 4.8 --- stacks/anisette/main.tf | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/stacks/anisette/main.tf b/stacks/anisette/main.tf index a8fbb8ec..44c0f3a5 100644 --- a/stacks/anisette/main.tf +++ b/stacks/anisette/main.tf @@ -78,7 +78,15 @@ resource "kubernetes_deployment" "anisette" { spec { container { # Pinned by digest — upstream ships only a mutable :latest (no tags). - image = "dadoum/anisette-v3-server@sha256:1e20384985d3c49965f444bef39d627768dacc39ea0dca91f2a535edb7591ba3" + # The `docker.io/` prefix is REQUIRED, not cosmetic: the Kyverno + # require-trusted-registries policy allowlists `docker.io/*` but NOT a + # bare `dadoum/*` prefix (only enumerated DockerHub user repos like + # mendhak/*, mpepping/* are listed in + # stacks/kyverno/modules/kyverno/security-policies.tf). A bare + # `dadoum/anisette-v3-server@...` is denied at admission; the explicit + # docker.io/ registry matches the allowlist and still pulls via the + # 10.0.20.10 pull-through cache. + image = "docker.io/dadoum/anisette-v3-server@sha256:1e20384985d3c49965f444bef39d627768dacc39ea0dca91f2a535edb7591ba3" name = "anisette" port { name = "http" From bc7b28244f93cd3a0c4fcea6abd8e609aeaabc5f Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 19:54:13 +0000 Subject: [PATCH 22/36] =?UTF-8?q?fix(anisette):=20raise=20memory=20limit?= =?UTF-8?q?=20to=20512Mi=20=E2=80=94=20128Mi=20OOMKilled=20at=20startup?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The pod CrashLooped with OOMKilled (exit 137): anisette downloads and initializes Apple's CoreADI provisioning library on startup, spiking past the 128Mi limit before it can bind :6969 (empty logs, liveness 'connection refused'). Bump request 256Mi / limit 512Mi; steady state is much lower. Co-Authored-By: Claude Opus 4.8 --- stacks/anisette/main.tf | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/stacks/anisette/main.tf b/stacks/anisette/main.tf index 44c0f3a5..260ebf3b 100644 --- a/stacks/anisette/main.tf +++ b/stacks/anisette/main.tf @@ -102,10 +102,14 @@ resource "kubernetes_deployment" "anisette" { resources { requests = { cpu = "10m" - memory = "128Mi" + memory = "256Mi" } limits = { - memory = "128Mi" + # anisette downloads + initializes Apple's CoreADI provisioning + # library at startup, which spikes past 128Mi → OOMKilled (exit + # 137) before it can bind :6969. 512Mi gives headroom; steady + # state is much lower. + memory = "512Mi" } } readiness_probe { From d8c60d7ab8ed4eeb2d58379d4b5a52b2199bb635 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 20:06:33 +0000 Subject: [PATCH 23/36] t3-afk: dedicated in-cluster T3 Code instance (AFK executor + cockpit) Slice #2 of claude-agent-service PRD #1 (AFK implementation pipeline). Dedicated in-cluster T3 Code instance the control plane dispatches issues into; runs the issue-implementer agent in a git worktree with a live cockpit. Applied + live 2026-06-14 (9 resources). Pilot-fast: stock docker.io/library/node:24 + install pinned t3@0.0.27 + Claude CLI at startup onto an SSD-NFS PVC. Authentik-gated ingress. issue-implementer behaviour ships as a user-level ~/.claude/CLAUDE.md (T3 hardcodes the system prompt; settingSources loads it) and forbids plan-mode/clarifying-questions so unattended threads don't stall. Keel-excluded (ADR 0003). wait_for_rollout=false (slow first start). Image fully-qualified for the Kyverno trusted-registries allowlist; container mem limit 4Gi (tier-aux LimitRange cap). Co-Authored-By: Claude Opus 4.8 --- .../t3-afk/files/issue-implementer-CLAUDE.md | 59 +++ stacks/t3-afk/main.tf | 348 ++++++++++++++++++ stacks/t3-afk/terragrunt.hcl | 18 + 3 files changed, 425 insertions(+) create mode 100644 stacks/t3-afk/files/issue-implementer-CLAUDE.md create mode 100644 stacks/t3-afk/main.tf create mode 100644 stacks/t3-afk/terragrunt.hcl diff --git a/stacks/t3-afk/files/issue-implementer-CLAUDE.md b/stacks/t3-afk/files/issue-implementer-CLAUDE.md new file mode 100644 index 00000000..995c701f --- /dev/null +++ b/stacks/t3-afk/files/issue-implementer-CLAUDE.md @@ -0,0 +1,59 @@ +# issue-implementer — autonomous AFK coding agent + +You are **issue-implementer**, an autonomous agent that implements ONE GitHub +issue end-to-end and lands it, with no human at the keyboard. This file is your +standing behaviour; the specific task arrives as your prompt. You run inside a +T3 Code thread in `full-access` mode (skip-permissions) — there is no one to +answer questions mid-run. + +## Autonomy — non-negotiable (you will hang otherwise) + +- **Never enter plan mode and never call `ExitPlanMode`.** It is intercepted and + will stall this thread forever. +- **Never ask clarifying questions / never call `AskUserQuestion`.** No human is + watching. Make the most reasonable assumption, state it in a commit/your final + message, and proceed. +- If you hit something you genuinely cannot resolve safely, **stop and write a + precise blocker report as your final message** (what you tried, what's + unresolved, what you'd need). Do not thrash. The orchestrator escalates it to a + human — that is the only "ask for help" channel you have. + +## What to do + +1. **Understand the task.** Your prompt contains the issue (number, what to + build, acceptance criteria). Read the issue's AGENT-BRIEF if present. +2. **Work in the prepared worktree.** You are already in a git worktree on a + branch off `master`. Read the repo's own `CLAUDE.md`, `CONTEXT.md`, and any + `docs/adr/` in the area you touch — use its domain vocabulary and respect its + decisions. +3. **Test-first (TDD).** Write a failing test that captures the desired + behaviour, make it pass, then refactor. Prefer property/parameterized tests. + Run the repo's actual test suite and get it green before you commit. Do not + test implementation details — test external behaviour. +4. **Commit.** Subject = what changed; body = why, paraphrasing the issue in + plain words. Include `Closes #` and the trailer + `Implemented-by: issue-implementer (AFK)`. Stage files by name — never + `git add -A`/`.`. Never skip hooks. +5. **Land it.** Push your branch to `master` (`git push origin HEAD:master`). If + the push is rejected non-fast-forward, fetch, merge `origin/master`, re-run + the tests, and push again. Pushing to `master` is the intended behaviour — + CI builds and deploys from there. +6. **Report.** Your final message is a concise summary: what you built, the + commit, and anything a reviewer should know. (CI/deploy watching and any + fix-forward/freeze handling are done by the control plane, not by you — once + you've pushed green code, your job is done.) + +## Guardrails (hard limits) + +- **Never force-push** to `master`. +- **Never delete PVCs/PVs**, drop database tables, or run destructive data ops. +- **Never edit Vault directly**, and never commit secrets. +- **Infrastructure changes go through Terraform/Terragrunt only** — never + `kubectl apply/edit/patch` as the final state. +- **Never use `[ci skip]`** — it hides the change from the audit feed. +- Stay within the issue's scope. Don't refactor adjacent code beyond what the + task needs. + +## Done means + +Tests green **and** pushed to `master`. Not "code written" — landed. diff --git a/stacks/t3-afk/main.tf b/stacks/t3-afk/main.tf new file mode 100644 index 00000000..22aedf0b --- /dev/null +++ b/stacks/t3-afk/main.tf @@ -0,0 +1,348 @@ +# ============================================================================= +# t3-afk — dedicated, in-cluster T3 Code instance: the EXECUTOR + COCKPIT for the +# AFK implementation pipeline (slice #2 of claude-agent-service PRD #1). +# +# claude-agent-service (control plane) dispatches issues INTO this T3 instance +# over its orchestration HTTP API; T3 runs the issue-implementer agent in a git +# worktree and shows every worker in its cockpit. See: +# claude-agent-service/docs/2026-06-14-afk-implementation-pipeline-design.md +# claude-agent-service/docs/adr/0003-t3-thin-executor-and-cockpit.md +# +# PILOT SHORTCUT (chosen 2026-06-14): no custom-built image. We run stock +# `node:24` (the full image ships git + python3/make/g++ for node-pty) and an +# init container installs PINNED npm packages (t3@0.0.27 + the Claude CLI) onto +# the SSD PVC, cached across restarts. Formalize a digest-pinned built image +# post-GO. T3 is version-pinned (npm) and NOT Keel-enrolled. +# ============================================================================= + +# No plan-time Vault reads — every secret flows through the ExternalSecret below +# (CLAUDE_CODE_OAUTH_TOKEN / GITHUB_TOKEN / FORGEJO_TOKEN), injected as env at +# runtime. Nothing here needs a secret value at plan time. + +# Wildcard TLS secret name — value comes from config.tfvars; consumed by the +# ingress factory (every stack that uses the factory declares this). +variable "tls_secret_name" {} + +locals { + namespace = "t3-afk" + # Stock node base — the FULL node:24 (not -slim) is buildpack-deps-based, so it + # ships git + build-essential (python3/make/g++) that node-pty + the agent need. + # Fully-qualified (docker.io/library/...) to satisfy the Kyverno + # require-trusted-registries allowlist via `docker.io/*` — bare `node*` is NOT + # on the bare-DockerHub-library list (alpine*/busybox*/python* are). + image = "docker.io/library/node:24" + # Pinned npm versions installed at startup (the reproducibility anchor for the + # pilot until a digest-pinned image exists). + t3_version = "0.0.27" + claude_cli_version = "latest" # @anthropic-ai/claude-code + labels = { + app = "t3-afk" + } +} + +# --- Namespace --- + +resource "kubernetes_namespace" "t3_afk" { + metadata { + name = local.namespace + labels = { + tier = local.tiers.aux + } + } +} + +# --- Secrets --- +# The Claude provider authenticates with CLAUDE_CODE_OAUTH_TOKEN (T3 passes the +# environment straight through to the embedded claude-agent-sdk + claude CLI). +# GITHUB_TOKEN / FORGEJO_TOKEN authenticate the agent's `git push` from worktrees +# (wired into ~/.gitconfig insteadOf rewrites in the container command). + +resource "kubernetes_manifest" "external_secret" { + manifest = { + apiVersion = "external-secrets.io/v1beta1" + kind = "ExternalSecret" + metadata = { + name = "t3-afk-secrets" + namespace = local.namespace + } + spec = { + refreshInterval = "15m" + secretStoreRef = { + name = "vault-kv" + kind = "ClusterSecretStore" + } + target = { name = "t3-afk-secrets" } + data = [ + { + secretKey = "CLAUDE_CODE_OAUTH_TOKEN" + remoteRef = { key = "claude-agent-service", property = "claude_oauth_token" } + }, + { + secretKey = "GITHUB_TOKEN" + remoteRef = { key = "viktor", property = "github_pat" } + }, + { + # Shared viktor-scoped admin PAT (also used by Woodpecker + the + # claude-agent pod). Lets the agent git push / open PRs on Forgejo. + secretKey = "FORGEJO_TOKEN" + remoteRef = { key = "ci/global", property = "forgejo_push_token" } + }, + ] + } + } + depends_on = [kubernetes_namespace.t3_afk] +} + +# issue-implementer behaviour. T3 hardcodes the claude_code system-prompt preset +# (no API override), but loads settingSources [user,project,local] — so the +# agent's standing instructions ride in the USER-level ~/.claude/CLAUDE.md, while +# each target repo's own CLAUDE.md provides project context. ADR 0003. +resource "kubernetes_config_map" "agent_claudemd" { + metadata { + name = "issue-implementer-claudemd" + namespace = kubernetes_namespace.t3_afk.metadata[0].name + } + data = { + "CLAUDE.md" = file("${path.module}/files/issue-implementer-CLAUDE.md") + } +} + +# --- Storage --- +# SSD-NFS (small-file friendly) for the T3 base dir: state.sqlite + the +# server-signing-key (losing it invalidates every issued bearer), per-thread git +# worktrees, the npm global install, and caches. ADR 0004. +module "data" { + source = "../../modules/kubernetes/nfs_volume" + name = "t3-afk-data" + namespace = kubernetes_namespace.t3_afk.metadata[0].name + nfs_server = "192.168.1.127" + nfs_path = "/srv/nfs-ssd/t3-afk-data" + storage = "30Gi" +} + +# --- Deployment --- + +resource "kubernetes_deployment" "t3_afk" { + # Slow first start (image pull + npm install init + ESO secret sync) can + # exceed the default rollout-wait timeout; verify pod readiness out-of-band. + wait_for_rollout = false + + metadata { + name = "t3-afk" + namespace = kubernetes_namespace.t3_afk.metadata[0].name + labels = local.labels + } + + spec { + replicas = 1 + # Single-writer state.sqlite — never run two pods against the same base dir. + strategy { + type = "Recreate" + } + + selector { + match_labels = local.labels + } + + template { + metadata { + labels = merge(local.labels, { + # Belt-and-braces: this namespace isn't Keel-enrolled, but pin the + # churny pre-1.0 T3 explicitly out of any auto-upgrade. ADR 0003. + "keel.sh/policy" = "never" + }) + } + + spec { + security_context { + run_as_user = 1000 # node + run_as_group = 1000 + fs_group = 1000 + } + + # NFS mounts land root-owned; make /data writable by uid 1000. + init_container { + name = "fix-perms" + image = "busybox:1.37" + command = ["sh", "-c", "mkdir -p /data && chown -R 1000:1000 /data && chmod 0775 /data"] + security_context { + run_as_user = 0 + } + volume_mount { + name = "data" + mount_path = "/data" + } + resources { + requests = { memory = "32Mi" } + limits = { memory = "64Mi" } + } + } + + # Install pinned t3 + Claude CLI onto the PVC (cached; skipped if already + # present). Runs as uid 1000 so the install is owned by the runtime user. + init_container { + name = "install-t3" + image = local.image + command = ["bash", "-c", <<-EOF + set -e + export npm_config_cache=/data/npm-cache + export npm_config_prefix=/data/npm-global + mkdir -p /data/npm-global /data/npm-cache + if [ ! -x /data/npm-global/bin/t3 ]; then + echo "installing t3@${local.t3_version} + claude CLI ..." + npm install -g "t3@${local.t3_version}" "@anthropic-ai/claude-code@${local.claude_cli_version}" + else + echo "t3 already installed: $(/data/npm-global/bin/t3 --version 2>/dev/null || echo unknown)" + fi + EOF + ] + volume_mount { + name = "data" + mount_path = "/data" + } + resources { + requests = { cpu = "200m", memory = "512Mi" } + limits = { memory = "1Gi" } + } + } + + container { + name = "t3" + image = local.image + + # Configure git auth for the agent's pushes, then run T3 headless. + # $$ escapes Terraform interpolation so the shell expands the env vars. + command = ["bash", "-c", <<-EOF + set -e + export PATH=/data/npm-global/bin:$$PATH + export npm_config_cache=/data/npm-cache + + # git identity + token rewrites so the agent can push from worktrees. + git config --global user.name "issue-implementer (AFK)" + git config --global user.email "afk-agent@viktorbarzin.me" + git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "https://github.com/" + git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "git@github.com:" + if [ -n "$${FORGEJO_TOKEN}" ]; then + git config --global url."https://$${FORGEJO_TOKEN}@forgejo.viktorbarzin.me/".insteadOf "https://forgejo.viktorbarzin.me/" + fi + + exec t3 serve --mode web --host 0.0.0.0 --port 3773 --base-dir /data/t3 + EOF + ] + + port { + container_port = 3773 + } + + env_from { + secret_ref { + name = "t3-afk-secrets" + } + } + + env { + name = "HOME" + value = "/home/node" + } + env { + name = "T3CODE_HOME" + value = "/data/t3" + } + + # T3's API needs auth even for liveness; use a TCP probe on the port. + liveness_probe { + tcp_socket { + port = 3773 + } + initial_delay_seconds = 30 + period_seconds = 30 + } + readiness_probe { + tcp_socket { + port = 3773 + } + initial_delay_seconds = 15 + period_seconds = 10 + } + + volume_mount { + name = "data" + mount_path = "/data" + } + # User-level agent instructions (settingSources: user). + volume_mount { + name = "agent-claudemd" + mount_path = "/home/node/.claude/CLAUDE.md" + sub_path = "CLAUDE.md" + } + + # Burstable (tier-aux). A live agent thread (node + claude) is memory + # heavy; size for a small number of concurrent threads on this pilot + # instance. No CPU limit per cluster policy. + resources { + requests = { + cpu = "1" + memory = "2Gi" + } + # Capped at the tier-aux LimitRange max (4Gi/container). If real + # workloads OOM, opt the namespace out via the + # resource-governance/custom-limitrange label (as claude-agent-service + # does) and raise this. + limits = { + memory = "4Gi" + } + } + } + + volume { + name = "data" + persistent_volume_claim { + claim_name = module.data.claim_name + } + } + + volume { + name = "agent-claudemd" + config_map { + name = kubernetes_config_map.agent_claudemd.metadata[0].name + } + } + } + } + } + + lifecycle { + ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 + } +} + +# --- Service --- + +resource "kubernetes_service" "t3_afk" { + metadata { + name = "t3-afk" + namespace = kubernetes_namespace.t3_afk.metadata[0].name + labels = local.labels + } + spec { + selector = local.labels + port { + port = 3773 + target_port = 3773 + } + type = "ClusterIP" + } +} + +# --- Ingress --- +# The cockpit has no built-in user auth, so Authentik forward-auth is the gate. +module "ingress" { + source = "../../modules/kubernetes/ingress_factory" + auth = "required" + dns_type = "proxied" + namespace = kubernetes_namespace.t3_afk.metadata[0].name + name = "t3-afk" + service_name = kubernetes_service.t3_afk.metadata[0].name + port = 3773 + tls_secret_name = var.tls_secret_name +} diff --git a/stacks/t3-afk/terragrunt.hcl b/stacks/t3-afk/terragrunt.hcl new file mode 100644 index 00000000..6b746c65 --- /dev/null +++ b/stacks/t3-afk/terragrunt.hcl @@ -0,0 +1,18 @@ +include "root" { + path = find_in_parent_folders() +} + +dependency "platform" { + config_path = "../platform" + skip_outputs = true +} + +dependency "vault" { + config_path = "../vault" + skip_outputs = true +} + +dependency "external-secrets" { + config_path = "../external-secrets" + skip_outputs = true +} From 214638216bb84a8c8420960fe58c0f27fca005f7 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Sun, 14 Jun 2026 20:56:12 +0000 Subject: [PATCH 24/36] fix(anisette): wait_for_rollout=false so a slow first start can't strand the deploy out of state MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The docker.io fix created the deployment, but wait_for_rollout (default true) then hung on the OOMing pod and the apply failed — leaving the deployment in the cluster but NOT in terraform state, so every later apply hit 'deployments.apps "anisette" already exists'. Deleted that orphan and set wait_for_rollout=false (mirrors tts/llama-cpp slow-start services); readiness probe still gates Service traffic. Co-Authored-By: Claude Opus 4.8 --- stacks/anisette/main.tf | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/stacks/anisette/main.tf b/stacks/anisette/main.tf index 260ebf3b..c7dad0ba 100644 --- a/stacks/anisette/main.tf +++ b/stacks/anisette/main.tf @@ -55,6 +55,12 @@ resource "kubernetes_deployment" "anisette" { tier = local.tiers.aux } } + # anisette downloads + initializes Apple's CoreADI provisioning library on + # first start (slow, memory-spiky). wait_for_rollout=false so the apply never + # blocks on — and never strands out of terraform state — a pod that is still + # warming up (mirrors tts/llama-cpp). Pod health is still gated by the + # readiness probe below, so the Service only routes once it's actually up. + wait_for_rollout = false spec { replicas = 1 selector { From 82a0c5aedf1528f5fd4a567609a539472c0f1fba Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 10:32:38 +0000 Subject: [PATCH 25/36] =?UTF-8?q?t3-afk:=20fix=20crashloop=20=E2=80=94=20e?= =?UTF-8?q?xclude=20from=20Keel=20at=20the=20deployment=20level?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Keel "patch"-downgraded the image docker.io/library/node:24 -> library/node:24.0.2, which is below t3@0.0.27's required node >=24.10, so `t3 serve` exited silently and the pod crash-looped (~160 restarts / 13h). Root cause: keel.sh/policy=never was on the POD-TEMPLATE labels, but Keel reads the policy at the DEPLOYMENT level. The cluster's Kyverno inject-keel-annotations is opt-out, so it stamped policy=patch and Keel acted on it. Fix: set keel.sh/policy=never as a deployment-level annotation; ignore_changes the Kyverno-injected keel.sh/pollSchedule + keel.sh/trigger annotations; the image stays TF-owned (apply reverted Keel's downgrade). Pod now 1/1, t3 serve 200. Co-Authored-By: Claude Opus 4.8 --- stacks/t3-afk/main.tf | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/stacks/t3-afk/main.tf b/stacks/t3-afk/main.tf index 22aedf0b..a56cffde 100644 --- a/stacks/t3-afk/main.tf +++ b/stacks/t3-afk/main.tf @@ -131,6 +131,16 @@ resource "kubernetes_deployment" "t3_afk" { name = "t3-afk" namespace = kubernetes_namespace.t3_afk.metadata[0].name labels = local.labels + # keel.sh/policy=never must be a DEPLOYMENT-level annotation — that's where + # Keel reads it. (A pod-template label is ignored by Keel, which is why the + # earlier attempt failed.) The cluster's Kyverno inject-keel-annotations + # policy is opt-OUT: it stamps policy=patch on any workload that doesn't + # carry its own keel.sh/policy — and Keel then "patch"-downgraded + # node:24 -> node:24.0.2 (below t3@0.0.27's required node >=24.10), which + # crash-looped `t3 serve`. ADR 0003 (Keel-excluded). + annotations = { + "keel.sh/policy" = "never" + } } spec { @@ -146,11 +156,7 @@ resource "kubernetes_deployment" "t3_afk" { template { metadata { - labels = merge(local.labels, { - # Belt-and-braces: this namespace isn't Keel-enrolled, but pin the - # churny pre-1.0 T3 explicitly out of any auto-upgrade. ADR 0003. - "keel.sh/policy" = "never" - }) + labels = local.labels } spec { @@ -312,7 +318,14 @@ resource "kubernetes_deployment" "t3_afk" { } lifecycle { - ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 + ignore_changes = [ + spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1 + # Kyverno's inject-keel-annotations stamps pollSchedule/trigger alongside + # the policy; we own keel.sh/policy=never above, but ignore these two so + # they don't perpetually drift the plan. + metadata[0].annotations["keel.sh/pollSchedule"], + metadata[0].annotations["keel.sh/trigger"], + ] } } From bb3f5f23299264a745df676e3d70b659fa7b2a99 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 14:37:59 +0000 Subject: [PATCH 26/36] workstation: stop the Claude Code onboarding wizard reappearing for terminal users MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit emo reported being "logged out" on terminal.viktorbarzin.me: every new shell dropped him at the first-run "Choose the text style" wizard, even though he'd used many sessions and is in fact fully authenticated. Root cause is NOT a logout — ~/.claude.json is a single file that all of a user's concurrent claude processes (the ttyd terminal + their t3-serve instance + agent sessions) read-modify-write, and a stale writer periodically drops top-level keys, including hasCompletedOnboarding. That bounces the next interactive session back to onboarding; credentials are safe in the separate ~/.claude/.credentials.json (which is why T3 kept working). wizard's own ~/.claude.json showed the same key loss, so this hits any heavy multi-session user. Fix: - skel/start-claude.sh: ensure_onboarding() idempotently re-asserts hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before launching claude. Merge-only (never clobbers other keys), runs as the user, and no-ops if jq is missing or the file is empty/corrupt. So even if the race drops the flag, the next launch restores it before claude reads it. - t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel only seeds the launcher at account creation, so without this the fix (and any future launcher edit) would never reach existing users. .tmux.conf is deliberately not re-copied — terminal-lobby appends a managed section to it. Co-Authored-By: Claude Opus 4.8 --- docs/architecture/multi-tenancy.md | 2 ++ scripts/t3-provision-users.sh | 19 +++++++++++++++ scripts/workstation/skel/start-claude.sh | 31 ++++++++++++++++++++++++ 3 files changed, 52 insertions(+) diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md index 2a5eebbf..baaf8007 100644 --- a/docs/architecture/multi-tenancy.md +++ b/docs/architecture/multi-tenancy.md @@ -543,6 +543,8 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour. +**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it). + **Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `/` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`). **Contribute access (per non-admin, manual — the anca/tripit PAT precedent):** diff --git a/scripts/t3-provision-users.sh b/scripts/t3-provision-users.sh index 31bc6f08..593de0f9 100644 --- a/scripts/t3-provision-users.sh +++ b/scripts/t3-provision-users.sh @@ -270,6 +270,24 @@ install_user_claude_token() { log "shared Claude token -> $user (t3-serve env; restart needed to take effect)" } +# Re-deploy the managed per-user Claude launcher to ~/start-claude.sh. /etc/skel only +# seeds it at account creation (setup-devvm.sh), so without this a launcher edit never +# reaches EXISTING users — they keep running a stale copy. Copy-if-changed from the repo's +# skel/, owned by the user, 0755. (We deliberately do NOT re-copy .tmux.conf: terminal-lobby +# appends a managed persistence section to each user's ~/.tmux.conf that a re-copy would clobber.) +deploy_user_launcher() { + local user="$1" home src dst + src="$WORKSTATION_DIR/skel/start-claude.sh" + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" && -f "$src" ]] || return 0 + dst="$home/start-claude.sh" + cmp -s "$src" "$dst" 2>/dev/null && return 0 # already current -> no churn + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] deploy launcher -> $dst"; return 0; fi + install -m 0755 "$src" "$dst" + chown "$user:$user" "$dst" + log "deployed start-claude.sh -> $user" +} + [[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; } for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done [[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; } @@ -346,6 +364,7 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do fi install_user_kubeconfig "$os_user" install_user_claude_token "$os_user" + deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts) fi refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (if (.groups|length)==0 then "-" else (.groups|join(",")) end), .code_layout, (if (.repos|length)==0 then "-" else (.repos|join(",")) end)] | @tsv' "$desired_file") diff --git a/scripts/workstation/skel/start-claude.sh b/scripts/workstation/skel/start-claude.sh index 4feb44d7..dcd716fb 100755 --- a/scripts/workstation/skel/start-claude.sh +++ b/scripts/workstation/skel/start-claude.sh @@ -51,6 +51,37 @@ launch() { fi } +# Re-assert Claude Code's first-run onboarding flag before launch. ~/.claude.json is a +# SINGLE file that ALL of a user's concurrent claude processes (this terminal, their +# t3-serve instance, agent/SDK sessions) read-modify-write; a stale writer periodically +# drops top-level keys — including hasCompletedOnboarding — which throws the next +# interactive session back to the "Choose the text style" wizard even though the user is +# fully logged in (credentials live in the SEPARATE ~/.claude/.credentials.json, which is +# never affected). Idempotent, runs as the user right before launch, never clobbers other +# keys. Best-effort: no-op if jq is missing or the file is empty/corrupt (claude self-heals). +ensure_onboarding() { + command -v jq >/dev/null 2>&1 || return 0 + local cfg="$HOME/.claude.json" ver tmp + ver="$(claude --version 2>/dev/null | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)" + if [ -s "$cfg" ]; then + jq -e . "$cfg" >/dev/null 2>&1 || return 0 # corrupt -> leave for claude + [ "$(jq -r '.hasCompletedOnboarding // false' "$cfg")" = "true" ] && return 0 # already set -> no write + elif [ -e "$cfg" ]; then + return 0 # empty (mid-write?) -> leave it + fi + tmp="$(mktemp "${cfg}.XXXXXX")" || return 0 + if [ -f "$cfg" ]; then + jq --arg v "$ver" '.hasCompletedOnboarding = true + | (if $v != "" then .lastOnboardingVersion = $v else . end)' "$cfg" > "$tmp" 2>/dev/null \ + && chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp" + else + jq -n --arg v "$ver" '{hasCompletedOnboarding: true} + + (if $v != "" then {lastOnboardingVersion: $v} else {} end)' > "$tmp" 2>/dev/null \ + && chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp" + fi +} +ensure_onboarding + # Deliberately not `exec` so we can branch on the exit code: clean quit ends the # pane (ttyd closes the terminal); a crash drops to a shell so the tmux session # isn't destroyed-and-recreated in a ttyd auto-reconnect loop. From 4a48f065e961445aff938c4ecab98d0ffcfab0d9 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 17:03:37 +0000 Subject: [PATCH 27/36] mcp: drop project-scoped paperless from .mcp.json (paperless is now wizard-only) Paperless is a personal tool for wizard, not shared. It was project-scoped in the infra repo's .mcp.json (the in-cluster paperless-mcp proxy), so every user whose ~/code IS an infra clone (emo, ancamilea) auto-loaded it. Per request, paperless should be wizard-only: wizard now runs his own direct, token-based paperless MCP in his user-scope config (a local barryw/paperlessmcp container -> paperless-ngx). Removing the shared entry so emo and other infra-clone users no longer get it; the `ha` MCP stays project-scoped. emo's clone drops it on next freshen. Co-Authored-By: Claude Opus 4.8 --- .mcp.json | 4 ---- 1 file changed, 4 deletions(-) diff --git a/.mcp.json b/.mcp.json index 9f39ff76..18bb4d81 100644 --- a/.mcp.json +++ b/.mcp.json @@ -3,10 +3,6 @@ "ha": { "type": "http", "url": "${HA_MCP_URL}" - }, - "paperless": { - "type": "http", - "url": "http://paperless-mcp.paperless-mcp.svc.cluster.local/mcp" } } } From eecd78233bc2bafaea6537d2dece0c490721e8ce Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 17:12:05 +0000 Subject: [PATCH 28/36] workstation: standardize on the native claude install (drop npm-global + npx) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Question from Viktor: should claude run via the binary or npx? Answer: the native install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude; installMethod=native) — and every existing user had already auto-migrated to it, leaving the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup": - setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and just shadowed the per-user native installs). - t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap. - skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via the native installer (was an `npx @anthropic-ai/claude-code` fallback). - docs/architecture/multi-tenancy.md: documented the native-only runtime model. node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable + produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck. Co-Authored-By: Claude Opus 4.8 --- docs/architecture/multi-tenancy.md | 2 ++ scripts/t3-provision-users.sh | 20 ++++++++++++++++++++ scripts/workstation/setup-devvm.sh | 16 +++++++--------- scripts/workstation/skel/start-claude.sh | 13 ++++++++----- 4 files changed, 37 insertions(+), 14 deletions(-) diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md index baaf8007..17163820 100644 --- a/docs/architecture/multi-tenancy.md +++ b/docs/architecture/multi-tenancy.md @@ -545,6 +545,8 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it). +**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. + **Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `/` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`). **Contribute access (per non-admin, manual — the anca/tripit PAT precedent):** diff --git a/scripts/t3-provision-users.sh b/scripts/t3-provision-users.sh index 593de0f9..c5bbe4a9 100644 --- a/scripts/t3-provision-users.sh +++ b/scripts/t3-provision-users.sh @@ -288,6 +288,25 @@ deploy_user_launcher() { log "deployed start-claude.sh -> $user" } +# Ensure the per-user NATIVE claude install (the recommended runtime: ~user/.local/bin/claude, +# self-updating) — used by BOTH the terminal launcher AND the user's t3-serve instance. We do +# NOT npm-install claude system-wide (npm/npx isn't the recommended runtime); each user gets +# their own native install. Idempotent: skip if already present. Runs the official native +# installer AS the user (into their ~/.local). Best-effort: a failure WARNs and retries next +# reconcile (start-claude.sh also self-bootstraps the terminal path). +install_user_claude_native() { + local user="$1" home + home="$(getent passwd "$user" | cut -d: -f6)" + [[ -n "$home" && -d "$home" ]] || return 0 + [[ -x "$home/.local/bin/claude" ]] && return 0 # already native -> done + if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] native claude install -> $user"; return 0; fi + if runuser -u "$user" -- bash -lc 'curl -fsSL https://claude.ai/install.sh | bash' >/dev/null 2>&1; then + log "installed native claude -> $user" + else + log "WARN: native claude install failed for $user (retries next reconcile)" + fi +} + [[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; } for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done [[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; } @@ -367,6 +386,7 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts) fi refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd + install_user_claude_native "$os_user" # all tiers — per-user native claude (terminal + t3); no npm/npx done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (if (.groups|length)==0 then "-" else (.groups|join(",")) end), .code_layout, (if (.repos|length)==0 then "-" else (.repos|join(",")) end)] | @tsv' "$desired_file") # 5) per-user .env (sticky port) + enable t3-serve@ diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index 4bf6908b..be6e0e12 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -21,7 +21,13 @@ export DEBIAN_FRONTEND=noninteractive apt-get update -qq apt-get install -y "${PKGS[@]}" >/dev/null -# 2) node >= 18 + claude-code (claude-code requires node >= 18) +# 2) node >= 18 — needed for the t3 CLI (npm-global, below). NOT for claude-code: +# claude-code is the per-user NATIVE install (the recommended, self-updating +# ~/.local/bin/claude), provisioned per user by t3-provision-users +# (install_user_claude_native) and self-bootstrapped by start-claude.sh on first launch. +# We deliberately do NOT `npm install -g @anthropic-ai/claude-code` — npm/npx is not the +# recommended runtime, and a system-wide npm copy just shadows/duplicates the per-user +# native installs everyone auto-migrates to anyway. need_node=1 if command -v node >/dev/null; then [[ "$(node -v | sed 's/^v\([0-9]*\).*/\1/')" -ge 18 ]] && need_node=0 @@ -31,14 +37,6 @@ if [[ $need_node -eq 1 ]]; then curl -fsSL https://deb.nodesource.com/setup_22.x | bash - >/dev/null apt-get install -y nodejs >/dev/null fi -# Detect the GLOBAL npm package, NOT whatever `claude` resolves to on PATH: the admin's -# personal ~/.local/bin/claude shadows it, so `command -v claude` silently skipped the -# system-wide install — leaving /usr/lib/node_modules/@anthropic-ai empty and fresh -# non-admins with no claude (they only worked because the admin's install was on PATH). -if ! npm ls -g --depth=0 @anthropic-ai/claude-code >/dev/null 2>&1; then - log "npm: installing @anthropic-ai/claude-code (system-wide)" - npm install -g @anthropic-ai/claude-code >/dev/null -fi # 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and # ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind diff --git a/scripts/workstation/skel/start-claude.sh b/scripts/workstation/skel/start-claude.sh index dcd716fb..2353eace 100755 --- a/scripts/workstation/skel/start-claude.sh +++ b/scripts/workstation/skel/start-claude.sh @@ -42,13 +42,16 @@ else done fi -# Prefer the system-wide `claude` (installed by setup-devvm.sh); fall back to npx. +# Run the NATIVE `claude` (the recommended install: ~/.local/bin/claude, self-updating). +# No npm/npx. If the native binary is missing (a fresh account before the hourly reconcile +# has provisioned it), bootstrap it with the official native installer, then run it. launch() { - if command -v claude >/dev/null 2>&1; then - claude "$@" - else - npx @anthropic-ai/claude-code "$@" + if ! command -v claude >/dev/null 2>&1; then + echo " Installing Claude Code (native) for $(id -un) …" + curl -fsSL https://claude.ai/install.sh | bash || return 127 + export PATH="$HOME/.local/bin:$PATH" fi + claude "$@" } # Re-assert Claude Code's first-run onboarding flag before launch. ~/.claude.json is a From ef555c7e02fb2fb9fcedc441c7e5ec48619159cb Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 17:20:03 +0000 Subject: [PATCH 29/36] workstation: put ~/.local/bin on PATH so the launcher finds native claude MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in tmux's NON-login bash env, which doesn't source the user's shell rc where the native installer put ~/.local/bin on PATH. So `command -v claude` failed there → the launcher's bootstrap re-ran the native installer → the installer printed the PATH warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit, and t3-serve sets PATH in its unit — so only the terminal launcher was affected.) - skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent), before the launch logic — so `claude` is found, no reinstall, no warning. - setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile. - docs/architecture/multi-tenancy.md: documented the three PATH-injection points. Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n. Co-Authored-By: Claude Opus 4.8 --- docs/architecture/multi-tenancy.md | 2 +- scripts/workstation/setup-devvm.sh | 17 +++++++++++++++++ scripts/workstation/skel/start-claude.sh | 8 ++++++++ 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/docs/architecture/multi-tenancy.md b/docs/architecture/multi-tenancy.md index 17163820..7764ebb1 100644 --- a/docs/architecture/multi-tenancy.md +++ b/docs/architecture/multi-tenancy.md @@ -545,7 +545,7 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1 **Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it). -**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. +**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`). **Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `/` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`). diff --git a/scripts/workstation/setup-devvm.sh b/scripts/workstation/setup-devvm.sh index be6e0e12..b0275bbf 100755 --- a/scripts/workstation/setup-devvm.sh +++ b/scripts/workstation/setup-devvm.sh @@ -38,6 +38,23 @@ if [[ $need_node -eq 1 ]]; then apt-get install -y nodejs >/dev/null fi +# 2a) ~/.local/bin on PATH for all LOGIN shells (machine-wide). The native claude install +# lives at ~/.local/bin; this guarantees login shells (SSH, etc.) find it regardless of +# whether the per-user native-installer rc edit ran. (The terminal launcher sets PATH +# itself, and t3-serve@.service hard-sets PATH in the unit.) +install -d -m 0755 /etc/profile.d +cat > /etc/profile.d/10-local-bin.sh <<'PROFILE_EOF' +# Native per-user installs (e.g. claude-code) live in ~/.local/bin — put it on PATH. +# Guarded so it never duplicates. Sourced by login shells (bash via /etc/profile; zsh +# login via /etc/zsh/zprofile -> /etc/profile). +case ":$PATH:" in + *":$HOME/.local/bin:"*) ;; + *) export PATH="$HOME/.local/bin:$PATH" ;; +esac +PROFILE_EOF +chmod 0644 /etc/profile.d/10-local-bin.sh +log "/etc/profile.d/10-local-bin.sh (~/.local/bin on PATH for login shells)" + # 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and # ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind # (2026-06-09 outage: a nightly auto-update broke pairing for ALL users). The daily diff --git a/scripts/workstation/skel/start-claude.sh b/scripts/workstation/skel/start-claude.sh index 2353eace..b3e25744 100755 --- a/scripts/workstation/skel/start-claude.sh +++ b/scripts/workstation/skel/start-claude.sh @@ -11,6 +11,14 @@ echo " Starting Claude Code in $HOME/code ..." echo " (Right-click for tmux menu, or Ctrl+B then | or - to split)" echo "" +# The native claude install lives in ~/.local/bin. This launcher runs in tmux's non-login +# env, which does NOT source the user's shell rc (where the native installer added it to +# PATH) — so `claude` would appear missing here. Put it on PATH ourselves; guarded/idempotent. +case ":$PATH:" in + *":$HOME/.local/bin:"*) ;; + *) export PATH="$HOME/.local/bin:$PATH" ;; +esac + name_args=() if [ -n "${TMUX:-}" ]; then sess="$(tmux display-message -p '#{session_name}' 2>/dev/null)" From 92c5b2497545be0b030f6c1b004b5f341906020d Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 20:19:17 +0000 Subject: [PATCH 30/36] docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias Minted a dedicated classic GitHub PAT scoped to read:packages and stored it in Vault secret/viktor/ghcr_pull_token (2026-06-15), replacing the previous alias of the broad admin github_pat. Propagated via targeted apply of module.kyverno.kubernetes_secret.ghcr_credentials (Kyverno re-syncs the allowlisted namespaces). Document the new cred + the manual rotation recipe. Closes: code-h2il Co-Authored-By: Claude Fable 5 --- .claude/CLAUDE.md | 9 ++++++--- docs/architecture/ci-cd.md | 10 +++++++--- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 1a81118b..d2e581f4 100755 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -138,9 +138,12 @@ audiobook-search, council-complaints) now also land on ghcr. wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci, k8s-portal. Pulled via the Kyverno-synced `ghcr-credentials` allowlist (`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`; NOT cluster-wide; cred - = Vault `secret/viktor/ghcr_pull_token`, an alias of the admin `github_pat` — - GitHub has no token-mint API, swap the alias value if a scoped token is ever - UI-minted). + = Vault `secret/viktor/ghcr_pull_token`, a dedicated classic PAT scoped to + `read:packages` (UI-minted 2026-06-15; no longer the admin `github_pat` + alias). GitHub has no token-mint API, so rotation is manual: re-mint → + `vault kv patch secret/viktor ghcr_pull_token=…` → targeted apply + `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault, dodges the + git-crypt tls-secret-sync landmine), Kyverno re-syncs the allowlist). **Infra-owned images (issues #29/#30)** build on GHA workflows IN the infra repo's own `.github/workflows/` (added to the GitHub lineage via PR; the diff --git a/docs/architecture/ci-cd.md b/docs/architecture/ci-cd.md index c4493f86..1c78950f 100644 --- a/docs/architecture/ci-cd.md +++ b/docs/architecture/ci-cd.md @@ -100,9 +100,13 @@ Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit **ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source `stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault -`secret/viktor/ghcr_pull_token` (an alias of the admin `github_pat` — GitHub -has no token-mint API; swap the alias value if a scoped token is ever -UI-minted). +`secret/viktor/ghcr_pull_token` (a dedicated classic PAT scoped to +`read:packages`, UI-minted 2026-06-15 — no longer the admin `github_pat` alias. +GitHub has no token-mint API, so rotation is manual: re-mint the classic +`read:packages` PAT → `vault kv patch secret/viktor ghcr_pull_token=…` → +targeted apply `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault; +avoids the git-crypt `tls-secret-sync` landmine on a locked clone), which +Kyverno then re-syncs to the allowlisted namespaces). ### Migrated apps (issues #13–#27) From 34c30ac2bf416f0320713f9b025e22e47af37f4b Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 20:19:39 +0000 Subject: [PATCH 31/36] =?UTF-8?q?t3-afk:=20auto-pair=20dispatcher=20sideca?= =?UTF-8?q?r=20=E2=80=94=20no=20manual=20pairing?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The bare `t3 serve` behind Authentik showed the manual /pair#token screen, which didn't connect. Mirror the devvm t3-dispatch: a small stdlib-Node sidecar fronts t3 serve, and on a cookieless (already Authentik-gated) document load it mints a pairing credential (`t3 auth pairing create`) and exchanges it at /api/auth/browser-session for the t3_session cookie, then 302s back. Everything else — including WebSocket upgrades for the live cockpit — reverse-proxies to :3773. The Service now targets the sidecar (:8080). Verified: cookieless GET -> 302 + Set-Cookie t3_session; cookied GET -> 200 SPA. Matches the t3.viktorbarzin.me experience (Authentik login -> straight into the cockpit). Co-Authored-By: Claude Opus 4.8 --- stacks/t3-afk/files/dispatcher.js | 136 ++++++++++++++++++++++++++++++ stacks/t3-afk/main.tf | 63 +++++++++++++- 2 files changed, 198 insertions(+), 1 deletion(-) create mode 100644 stacks/t3-afk/files/dispatcher.js diff --git a/stacks/t3-afk/files/dispatcher.js b/stacks/t3-afk/files/dispatcher.js new file mode 100644 index 00000000..6cc9a800 --- /dev/null +++ b/stacks/t3-afk/files/dispatcher.js @@ -0,0 +1,136 @@ +// t3-afk auto-pair dispatcher +// ---------------------------------------------------------------------------- +// Replicates the devvm t3-dispatch experience for the single in-cluster T3 +// instance. The ingress is Authentik-gated (auth=required), so every request +// that reaches here is already authenticated. On a cookieless *document* +// navigation we mint a one-time pairing credential (`t3 auth pairing create`) +// and exchange it at the t3 server's /api/auth/browser-session endpoint for the +// `t3_session` cookie, then 302 back — so the user never sees the manual +// /pair#token screen. Everything else (incl. WebSocket upgrades for the cockpit +// live stream + terminals) is reverse-proxied straight through to t3 serve. +// +// Single upstream, same pod (localhost) — kept dependency-free (Node stdlib). +'use strict'; +const http = require('http'); +const net = require('net'); +const { execFile } = require('child_process'); + +const UPSTREAM_HOST = '127.0.0.1'; +const UPSTREAM_PORT = Number(process.env.T3_UPSTREAM_PORT || 3773); +const LISTEN_PORT = Number(process.env.DISPATCHER_PORT || 8080); +const T3_BIN = process.env.T3_BIN || '/data/npm-global/bin/t3'; +const BASE_DIR = process.env.T3CODE_HOME || '/data/t3'; +const COOKIE = 't3_session'; +const childEnv = { ...process.env, PATH: '/data/npm-global/bin:' + (process.env.PATH || ''), HOME: '/home/node' }; + +const hasSession = (req) => + (req.headers.cookie || '').split(/;\s*/).some((c) => c.startsWith(COOKIE + '=')); + +const isDocNav = (req) => { + if (req.method !== 'GET') return false; + const dest = req.headers['sec-fetch-dest']; + if (dest) return dest === 'document'; + return (req.headers['accept'] || '').includes('text/html'); +}; + +const mintCredential = () => + new Promise((resolve, reject) => { + execFile( + T3_BIN, + ['auth', 'pairing', 'create', '--base-dir', BASE_DIR, '--ttl', '5m', '--json'], + { env: childEnv, timeout: 15000 }, + (err, stdout) => { + if (err) return reject(err); + try { + const cred = JSON.parse(stdout).credential; + cred ? resolve(cred) : reject(new Error('no credential in pairing output')); + } catch (e) { + reject(e); + } + }, + ); + }); + +const exchange = (credential) => + new Promise((resolve, reject) => { + const body = JSON.stringify({ credential }); + const r = http.request( + { + host: UPSTREAM_HOST, + port: UPSTREAM_PORT, + path: '/api/auth/browser-session', + method: 'POST', + headers: { 'content-type': 'application/json', 'content-length': Buffer.byteLength(body) }, + }, + (resp) => { + const setCookie = resp.headers['set-cookie'] || []; + resp.resume(); + resp.on('end', () => + resp.statusCode === 200 && setCookie.length + ? resolve(setCookie) + : reject(new Error('browser-session exchange returned ' + resp.statusCode)), + ); + }, + ); + r.on('error', reject); + r.write(body); + r.end(); + }); + +const proxyHttp = (req, res) => { + const up = http.request( + { host: UPSTREAM_HOST, port: UPSTREAM_PORT, path: req.url, method: req.method, headers: req.headers }, + (r) => { + res.writeHead(r.statusCode, r.headers); + r.pipe(res); + }, + ); + up.on('error', () => { + if (!res.headersSent) res.writeHead(502); + res.end('bad gateway'); + }); + req.pipe(up); +}; + +const server = http.createServer(async (req, res) => { + if (req.url === '/healthz') { + res.writeHead(200); + return res.end('ok'); + } + if (!hasSession(req) && isDocNav(req)) { + try { + const cred = await mintCredential(); + const setCookie = await exchange(cred); + res.writeHead(302, { location: req.url || '/', 'set-cookie': setCookie, 'cache-control': 'no-store' }); + return res.end(); + } catch (err) { + // Fall through to a plain proxy; the cockpit's own /pair screen is the + // fallback if auto-pair ever fails, so we never hard-fail the request. + console.error('auto-pair failed, proxying through:', err.message); + } + } + proxyHttp(req, res); +}); + +// WebSocket / Upgrade passthrough — the cockpit's live orchestration stream and +// terminals need this. Reconstruct the upgrade request and splice the sockets. +server.on('upgrade', (req, socket, head) => { + const up = net.connect(UPSTREAM_PORT, UPSTREAM_HOST, () => { + up.write( + `${req.method} ${req.url} HTTP/1.1\r\n` + + Object.entries(req.headers) + .map(([k, v]) => `${k}: ${v}`) + .join('\r\n') + + '\r\n\r\n', + ); + if (head && head.length) up.write(head); + socket.pipe(up); + up.pipe(socket); + }); + up.on('error', () => socket.destroy()); + socket.on('error', () => up.destroy()); +}); + +server.listen(LISTEN_PORT, '0.0.0.0', () => + console.log(`t3-afk dispatcher listening on :${LISTEN_PORT} -> ${UPSTREAM_HOST}:${UPSTREAM_PORT}`), +); diff --git a/stacks/t3-afk/main.tf b/stacks/t3-afk/main.tf index a56cffde..063e42ad 100644 --- a/stacks/t3-afk/main.tf +++ b/stacks/t3-afk/main.tf @@ -107,6 +107,21 @@ resource "kubernetes_config_map" "agent_claudemd" { } } +# Auto-pair dispatcher script (run by the sidecar container below). Mirrors the +# devvm t3-dispatch: on a cookieless, Authentik-gated page load it mints a +# pairing credential and exchanges it for the t3_session cookie, so the user +# never sees the manual /pair screen. Reverse-proxies everything else (incl. +# WebSockets) to t3 serve. +resource "kubernetes_config_map" "dispatcher" { + metadata { + name = "t3-afk-dispatcher" + namespace = kubernetes_namespace.t3_afk.metadata[0].name + } + data = { + "dispatcher.js" = file("${path.module}/files/dispatcher.js") + } +} + # --- Storage --- # SSD-NFS (small-file friendly) for the T3 base dir: state.sqlite + the # server-signing-key (losing it invalidates every issued bearer), per-thread git @@ -300,6 +315,43 @@ resource "kubernetes_deployment" "t3_afk" { } } + # Auto-pair dispatcher (sidecar). The Service points at this (:8080); it + # reverse-proxies to t3 serve (:3773) and injects the session cookie so + # the browser experience matches t3.viktorbarzin.me. Shares /data so it + # can exec the t3 CLI to mint pairing credentials. + container { + name = "dispatcher" + image = local.image + command = ["node", "/scripts/dispatcher.js"] + port { + container_port = 8080 + } + env { + name = "HOME" + value = "/home/node" + } + readiness_probe { + http_get { + path = "/healthz" + port = 8080 + } + initial_delay_seconds = 10 + period_seconds = 10 + } + volume_mount { + name = "data" + mount_path = "/data" + } + volume_mount { + name = "dispatcher" + mount_path = "/scripts" + } + resources { + requests = { cpu = "50m", memory = "64Mi" } + limits = { memory = "256Mi" } + } + } + volume { name = "data" persistent_volume_claim { @@ -313,6 +365,13 @@ resource "kubernetes_deployment" "t3_afk" { name = kubernetes_config_map.agent_claudemd.metadata[0].name } } + + volume { + name = "dispatcher" + config_map { + name = kubernetes_config_map.dispatcher.metadata[0].name + } + } } } } @@ -339,9 +398,11 @@ resource "kubernetes_service" "t3_afk" { } spec { selector = local.labels + # Route to the auto-pair dispatcher sidecar (:8080), which reverse-proxies + # to t3 serve (:3773) after injecting the t3_session cookie. port { port = 3773 - target_port = 3773 + target_port = 8080 } type = "ClusterIP" } From 5d3a166b9403fbe6b3af7279d8c37c45979b7779 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 20:49:34 +0000 Subject: [PATCH 32/36] =?UTF-8?q?t3-afk:=20fix=20agent=20Bash=20=E2=80=94?= =?UTF-8?q?=20stop=20mounting=20into=20~/.claude?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Root cause of "the agent never commits": the issue-implementer CLAUDE.md was subPath-mounted at /home/node/.claude/CLAUDE.md, which made /home/node/.claude root-owned. The agent (uid 1000) then couldn't create its Bash session-env there, so EVERY Bash/git call failed (Write/Edit worked, so it silently edited but never committed). Found by reading the agent transcripts from state.sqlite -> projection_thread_messages. Fix: don't mount anything into ~/.claude (it's not honored by T3's SDK anyway). Behaviour is injected via the dispatch message preamble by the control plane; files/issue-implementer-CLAUDE.md kept as the canonical source text. Verified post-fix: a preamble-dispatched task edited README and COMMITTED (073ab28) unattended. Co-Authored-By: Claude Opus 4.8 --- stacks/t3-afk/main.tf | 38 ++++++++++++-------------------------- 1 file changed, 12 insertions(+), 26 deletions(-) diff --git a/stacks/t3-afk/main.tf b/stacks/t3-afk/main.tf index 063e42ad..f545271c 100644 --- a/stacks/t3-afk/main.tf +++ b/stacks/t3-afk/main.tf @@ -93,19 +93,11 @@ resource "kubernetes_manifest" "external_secret" { depends_on = [kubernetes_namespace.t3_afk] } -# issue-implementer behaviour. T3 hardcodes the claude_code system-prompt preset -# (no API override), but loads settingSources [user,project,local] — so the -# agent's standing instructions ride in the USER-level ~/.claude/CLAUDE.md, while -# each target repo's own CLAUDE.md provides project context. ADR 0003. -resource "kubernetes_config_map" "agent_claudemd" { - metadata { - name = "issue-implementer-claudemd" - namespace = kubernetes_namespace.t3_afk.metadata[0].name - } - data = { - "CLAUDE.md" = file("${path.module}/files/issue-implementer-CLAUDE.md") - } -} +# issue-implementer behaviour is intentionally NOT mounted as ~/.claude/CLAUDE.md: +# T3's SDK invocation doesn't honor it, and mounting a subPath into ~/.claude +# makes that dir root-owned and breaks the agent's Bash session-env. The control +# plane injects the behaviour as a dispatch message preamble instead; +# files/issue-implementer-CLAUDE.md is kept as the canonical source for that text. # Auto-pair dispatcher script (run by the sidecar container below). Mirrors the # devvm t3-dispatch: on a cookieless, Authentik-gated page load it mints a @@ -290,12 +282,13 @@ resource "kubernetes_deployment" "t3_afk" { name = "data" mount_path = "/data" } - # User-level agent instructions (settingSources: user). - volume_mount { - name = "agent-claudemd" - mount_path = "/home/node/.claude/CLAUDE.md" - sub_path = "CLAUDE.md" - } + # NOTE: do NOT mount anything into /home/node/.claude — a subPath + # mount makes that dir root-owned, which blocks the agent (uid 1000) + # from creating its Bash session-env there and breaks ALL Bash/git for + # the agent (root cause of the 2026-06-15 "agent never commits"). T3's + # SDK invocation doesn't honor ~/.claude/CLAUDE.md anyway, so the + # issue-implementer behaviour is injected via the dispatch message + # preamble by the control plane instead. # Burstable (tier-aux). A live agent thread (node + claude) is memory # heavy; size for a small number of concurrent threads on this pilot @@ -359,13 +352,6 @@ resource "kubernetes_deployment" "t3_afk" { } } - volume { - name = "agent-claudemd" - config_map { - name = kubernetes_config_map.agent_claudemd.metadata[0].name - } - } - volume { name = "dispatcher" config_map { From cf51cb45de3c5d0df6c528bdf55d352c4e0e24c6 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 21:32:28 +0000 Subject: [PATCH 33/36] docs(adr-0003): keep Forgejo canonical, complete the GitHub mirror (reject swap) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Grilled the 'swap Forgejo for GitHub' idea. Root cause of the divergence pain is an incomplete push-mirror rollout (14 repos dual-pushed, push_mirrors=0), not Forgejo itself — and CONTEXT.md already documents Forgejo-canonical + one-way GitHub mirror. Decision: don't swap; finish the mirror, name the GitHub-first exceptions, reconcile infra, enforce one-remote-per-clone. Adds ADR-0003 + the GitHub-first repo glossary term + dual-push/force-overwrite warnings on Canonical repo / GitHub mirror. Co-Authored-By: Claude Fable 5 --- CONTEXT.md | 10 +++++-- ...-keep-forgejo-canonical-complete-mirror.md | 30 +++++++++++++++++++ 2 files changed, 37 insertions(+), 3 deletions(-) create mode 100644 docs/adr/0003-keep-forgejo-canonical-complete-mirror.md diff --git a/CONTEXT.md b/CONTEXT.md index 84c897c2..d700f9ab 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -173,13 +173,17 @@ The split where every owned image is built+pushed by GitHub Actions and Woodpeck _Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002). **Canonical repo**: -The Forgejo `viktor/` repo — the only place commits land, workflow files included. -_Avoid_: "upstream" (ambiguous); committing anywhere else. +The Forgejo `viktor/` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target. +_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003). **GitHub mirror**: -The GitHub repo a **Canonical repo** push-mirrors to, one-way, so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync. +The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost. _Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror. +**GitHub-first repo**: +The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed. +_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**. + **Forgejo registry**: Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io. _Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it. diff --git a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md new file mode 100644 index 00000000..9e0e2192 --- /dev/null +++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md @@ -0,0 +1,30 @@ +# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub + +Status: accepted (extends ADR-0002) + +## Context + +Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model. + +The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup. + +## Decision + +Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`: + +- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.** +- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`. +- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge. +- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror." + +## Considered options + +- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference. +- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood. + +## Consequences + +- Divergence becomes structurally impossible — one push target per repo. +- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided. +- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.) +- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging. From cbca281aaa0856199c1cc74c252696b60e674feb Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 21:34:37 +0000 Subject: [PATCH 34/36] feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020) Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack). Co-Authored-By: Claude Opus 4.8 --- docs/architecture/authentication.md | 25 ++ docs/runbooks/tripit-external-signup.md | 226 ++++++++++++++++++ .../authentik/admin-services-restriction.tf | 15 ++ stacks/authentik/tripit-external.tf | 22 ++ 4 files changed, 288 insertions(+) create mode 100644 docs/runbooks/tripit-external-signup.md create mode 100644 stacks/authentik/tripit-external.tf diff --git a/docs/architecture/authentication.md b/docs/architecture/authentication.md index 9decc8dc..8de844de 100644 --- a/docs/architecture/authentication.md +++ b/docs/architecture/authentication.md @@ -108,6 +108,31 @@ All new users must use an invitation link to register. The invitation-enrollment Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience. +### TripIt External self-signup (open enrollment, fenced) + +Unlike every other app, **TripIt allows open public self-signup** for people +outside the homelab (ADR-0020 in the tripit repo; runbook +`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment` +flow (email + passkey, no password) creates the account and stamps it into the +parentless **`TripIt External`** group. Containment is two-layered: + +- **Forward-auth apps**: a branch prepended to the `admin-services-restriction` + catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and + denies every other `auth="required"` host. +- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth). + External users are contained because every sensitive OIDC app already requires a + trusted group they do not hold — audited 2026-06-15: + Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo → + `Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove → + `Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless + `default`-policy token) and is bound to **`Allow Login Users`** as part of this + change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC). + +**Invariants**: keep `TripIt External` parentless (never under `Allow Login +Users`); keep the catch-all branch first; never co-assign `TripIt External` to a +trusted/internal user; the `tripit-enrollment` user_write "Create users group" +setting is the keystone that tags every signup. + ### OIDC Applications Authentik provides OIDC for 10 applications: diff --git a/docs/runbooks/tripit-external-signup.md b/docs/runbooks/tripit-external-signup.md new file mode 100644 index 00000000..0172c9b1 --- /dev/null +++ b/docs/runbooks/tripit-external-signup.md @@ -0,0 +1,226 @@ +# Runbook — TripIt external user self-signup (email + passkey) + +Implements ADR-0020 (tripit repo): people outside the homelab self-register to +TripIt with **email + a passkey** (no password), are auto-tagged into the +**`TripIt External`** Authentik group, and are fenced to `tripit.viktorbarzin.me` +only. Audience: people Viktor knows; open public registration. + +> **Safety model.** Containment is two-layered. (1) **Forward-auth apps** — the +> branch in `stacks/authentik/admin-services-restriction.tf` admits `TripIt +> External` to `tripit.viktorbarzin.me` and denies every other `auth="required"` +> host. (2) **OIDC apps** — the branch does NOT cover OIDC (it bypasses +> forward-auth); External users are contained because every sensitive OIDC app +> already requires a trusted group they do not hold (audit below). The no-lockout +> guarantee is that the group is created **empty**, so the new branch matches +> zero existing users on day one. + +## OIDC app authorization audit (2026-06-15, read-only) + +A parentless `TripIt External` user holds NONE of these groups, so: + +| OIDC app | Requires | External user | +|---|---|---| +| Immich, Grafana, Linkwarden, Cloudflare Access | `Home Server Admins` | DENIED ✓ | +| Forgejo | `Task Submitters` / `Forgejo Users` | DENIED ✓ | +| Headscale | `Headscale Users` | DENIED ✓ | +| wrongmove | `Wrongmove Users` | DENIED ✓ | +| **Vault** | **was OPEN** → bound to `Allow Login Users` in Step 3 | DENIED after Step 3 | +| Kubernetes, Kubernetes Dashboard | OPEN | harmless — apiserver rejects OIDC tokens (idle) | +| TripIt App, Public | OPEN | by design (TripIt's own provider / guest) | + +Vault's JWT `default` role grants only Vault's built-in `default` policy (token +self-management, cubbyhole — **no** secret access), so the pre-fix exposure was a +near-powerless token; Step 3 closes it anyway. + +--- + +## Pre-flight gates (STOP if any fails) + +1. **`TripIt External` is net-new / empty** (no-lockout precondition): + ``` + kubectl -n authentik exec -i deploy/goauthentik-server -- ak shell <<'PY' + from authentik.core.models import Group + g = Group.objects.filter(name="TripIt External").first() + print("exists:", bool(g), "members:", g.users.count() if g else 0) + PY + ``` + Expect `exists: False`. If it exists with members → STOP. +2. **Authentik image pin matches live (B5)** — the policy edit auto-applies the + whole `authentik` stack; a stale pin re-triggers the 2026-06-10 downgrade + boot-storm: + ``` + kubectl -n authentik get deploy -o custom-columns=N:.metadata.name,IMG:.spec.template.spec.containers[0].image + ``` + Every `goauthentik`/`ak-outpost` image tag MUST equal + `stacks/authentik/modules/authentik/values.yaml` `global.image.tag` + (currently `2026.2.4`). If they differ → refresh the pin first. + +--- + +## Step 1 — Terraform (group + fence branch) + +Already written on this branch: +- `stacks/authentik/tripit-external.tf` — the empty, parentless group. +- `stacks/authentik/admin-services-restriction.tf` — the prepended fence branch. + +**Local plan gate (B4 — CI auto-applies on push with `-auto-approve`, so there is +NO human plan review in the apply path; do it here):** +``` +vault login -method=oidc +cd stacks/authentik && ../../scripts/tg plan +``` +Confirm the plan is **exactly**: +- `+ authentik_group.tripit_external` (create) +- `~ authentik_policy_expression.admin_services_restriction` (update in place — the + `expression` body gains ONLY the new branch; every other line byte-identical) +- **`Plan: 1 to add, 1 to change, 0 to destroy.`** + +ABORT if the plan shows any destroy/replace, any `authentik_provider_*` / +`authentik_outpost` / `authentik_flow*` / `helm_release`, or any other expression +change. + +**Apply** (presence-claim courtesy, then push = apply; land human-watched, B5): +``` +~/code/scripts/presence claim stack:authentik --purpose "ADR-0020 TripIt External group + fence branch" +# push the branch to master (this triggers CI tg apply on the authentik stack) +``` +Watch: GHA → Woodpecker `default.yml` apply → outpost stays healthy +(`kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost` = 2 +IPs; an anonymous request to any `auth=required` host still 302s to Authentik). +The branch is inert (empty group) so no access changes yet. + +--- + +## Step 2 — Authentik SMTP (B1, BLOCKER before any flow) + +Email verification is the **entire identity boundary** (TripIt trusts the +Authentik email verbatim). Authentik currently has the **default/unconfigured** +transport (`email.host = localhost`), so verification/recovery mail cannot send. + +Add to **both** `server.env` and `worker.env` in +`stacks/authentik/modules/authentik/values.yaml` (wire the password from a secret; +the cluster mailserver is what TripIt already relays through — +`mailserver.mailserver.svc`): +```yaml + - { name: AUTHENTIK_EMAIL__HOST, value: "mailserver.mailserver.svc" } + - { name: AUTHENTIK_EMAIL__PORT, value: "587" } + - { name: AUTHENTIK_EMAIL__USE_TLS, value: "true" } + - { name: AUTHENTIK_EMAIL__FROM, value: "noreply@viktorbarzin.me" } + - { name: AUTHENTIK_EMAIL__USERNAME, value: "" } # confirm relay creds + - { name: AUTHENTIK_EMAIL__PASSWORD, valueFrom: { secretKeyRef: { name: , key: } } } +``` +**Gate:** after apply, Authentik UI → System → Settings (or an Email stage) → +**Send test email**; it must arrive. Then prove enrollment cannot complete for an +address you do NOT control. + +--- + +## Step 3 — Bind Vault → `Allow Login Users` (close the one open OIDC gap) + +Authentik UI → Applications → **Vault** → bind an authorization policy requiring +group **`Allow Login Users`** (the base group every real homelab user inherits; +parentless `TripIt External` is excluded). This changes nothing for existing +users and denies External users at the Vault consent step. +Verify: an External test account (Step 6) cannot complete Vault OIDC login. + +--- + +## Step 4 — Build the flows (Authentik UI; UI-managed per ADR split) + +All three flows: designation as noted, no password stage. + +**Flow `tripit-enrollment`** (Enrollment): +| Order | Stage | Key settings | +|---|---|---| +| 5 | Captcha | reCAPTCHA **v2 checkbox** keys (v3/invisible fail — see `crowdsec-recaptcha-key-type`) | +| 10 | Identification | email only; **no** `password_stage`; `sources` optional | +| 20 | Email (verification) | activate, blocking — **before** user_write | +| 30 | WebAuthn authenticator setup | `user_verification = required`, `resident_key = required` | +| 40 | User Write | **`create_users_group` = `TripIt External`** (the keystone tag); `user_type = external` | +| 50 | User Login | session as default (`weeks=4`) | + +**Flow `tripit-login`** (Authentication, passwordless): +Identification (sets `enrollment_flow`/`recovery_flow`) → Authenticator +Validation (`device_classes = [webauthn]`, `user_verification = required`) → User +Login. Prefer routing a passkey-less email to recovery over minting a credential. + +**Flow `tripit-recovery`** (Recovery): +Identification (`pretend_user_exists = on`) → Email (recovery link) → WebAuthn +authenticator setup → User Login. Notify the account on recovery + new-passkey. + +> Do **NOT** bind the `brute-force-protection` ReputationPolicy to these flows — +> it denies anonymous users (2026-04-06 regression). The Captcha is the bot gate. + +--- + +## Step 5 — Surface "Sign up" + +Recommended: a **TripIt-scoped** signup link / share-invite rather than a global +login-screen button (narrower bot surface). Enrollment URL: +`https://authentik.viktorbarzin.me/if/flow/tripit-enrollment/`. + +--- + +## Step 6 — Verification (before/after — "all access keeps working") + +Hosts for the matrix (must be real `auth="required"` default-allow hosts, NOT +`auth="app"` apps like immich/nextcloud which bypass the catch-all): +`tripit`, `family`, `hackmd`, `health` (default-allow) + `terminal` (admin-only). + +**Before** (capture per user, no redirect-follow; 200=ALLOW, 302→authentik/403=DENY): +``` +COOKIE='authentik_session='; for H in tripit family hackmd health terminal; do + printf '%-10s %s\n' "$H" "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: $COOKIE" https://$H.viktorbarzin.me/)"; done +``` +Representative non-admin: `kadir.tugan@gmail.com` (Wrongmove-only) → tripit/family/hackmd/health ALLOW, terminal DENY. Admin `vbarzin@gmail.com` → all ALLOW. + +**After Step 1 apply — regression:** re-run identically; both users' results MUST +be unchanged (diff empty). + +**After flows — external smoke test (the security proof):** enrol a throwaway +account via the enrollment URL (email verify + passkey). Confirm it is tagged +`TripIt External`, then with its cookie: +``` +for H in tripit family hackmd health terminal frigate; do printf '%-10s %s\n' "$H" \ + "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: authentik_session=" https://$H.viktorbarzin.me/)"; done +``` +Expect **tripit=200, every other host DENY** (family/hackmd/health were ALLOW for +kadir — the contrast is the fence proof). Then: +- **OIDC containment:** with the external account, attempt OIDC login to Vault, + Immich, Forgejo, Grafana → each must be DENIED at the app's own login. +- **Auto-provision:** the TripIt `users` row exists (CNPG primary in ns `dbaas`: + `select id,email from tripit.users where email=''`). +- **Walling-off guard** `AuthentikWallingOffPublicPath` stays green. + +**Any 200 on a non-tripit host, or any OIDC app admitting the external account → +ROLLBACK.** + +--- + +## Step 7 — Standing regression probe (recommended) + +Add a permanent `TripIt External` identity to the `blackbox-exporter` guard +(`stacks/monitoring/.../authentik_walloff_probe.tf` pattern): assert 200 on +`tripit.viktorbarzin.me` AND DENY on `family.viktorbarzin.me`. This converts the +"branch stays first" and "user_write keeps the keystone tag" invariants into +automated `#security` alerts. + +--- + +## Rollback + +Revert the `admin-services-restriction.tf` expression (delete the branch) and push +(= apply); removing a prepended `if g: return …` is behaviour-preserving on +non-members, restoring prior authz. Disable/delete the throwaway external account +(with the branch gone, a tagged account falls into default-allow). The empty group +may stay (harmless). Plan-gate the revert too. + +## Operational invariants + +- `TripIt External` stays **parentless** (never under `Allow Login Users`). +- The fence branch stays **first** in `admin-services-restriction`. +- **Never** co-assign `TripIt External` to a trusted/internal user. +- The `tripit-enrollment` user_write **`create_users_group`** setting is the + keystone — re-verify after any flow edit (clearing it makes UNtagged accounts + that fall into default-allow). +- Authentik SMTP is a live dependency of enrollment + recovery. diff --git a/stacks/authentik/admin-services-restriction.tf b/stacks/authentik/admin-services-restriction.tf index 2dcc1ca2..6bf9ff59 100644 --- a/stacks/authentik/admin-services-restriction.tf +++ b/stacks/authentik/admin-services-restriction.tf @@ -49,6 +49,21 @@ resource "authentik_policy_expression" "admin_services_restriction" { host = request.context.get("host", "") + # TripIt External containment fence (ADR-0020 in the tripit repo). Publicly + # self-enrolled TripIt users (group "TripIt External", assigned by the + # tripit-enrollment flow's user_write) may reach tripit.viktorbarzin.me and + # NOTHING else. MUST be the FIRST host-dispatch branch: it is a request.user + # predicate that must dominate every host branch below, ESPECIALLY the + # default-allow `if host not in ADMIN_ONLY_HOSTS: return True` — placed after + # it, a tagged user would slip into other hosts. Safe to add: the group is + # net-new and created EMPTY, so this matches zero existing principals (no + # lockout). The fence is forward-auth ONLY; OIDC apps (Vault, Immich, …) + # contain External users via their own per-app group bindings — see + # docs/runbooks/tripit-external-signup.md. NEVER co-assign "TripIt External" + # to a trusted/internal user (this branch would fence them out of admin hosts). + if ak_is_group_member(request.user, name="TripIt External"): + return host == "tripit.viktorbarzin.me" + # t3 Workstation edge gate: only members of "T3 Users" may reach t3. # Placed BEFORE the ADMIN_ONLY_HOSTS early-return (t3 is intentionally not in # that set — it must not require Home-Server-Admins, just T3 Users membership). diff --git a/stacks/authentik/tripit-external.tf b/stacks/authentik/tripit-external.tf new file mode 100644 index 00000000..abfd249e --- /dev/null +++ b/stacks/authentik/tripit-external.tf @@ -0,0 +1,22 @@ +# "TripIt External" group — containment anchor for publicly self-enrolled TripIt +# users (ADR-0020 in the tripit repo). Members are admitted to +# tripit.viktorbarzin.me ONLY and denied every other *.viktorbarzin.me +# forward-auth host by the prepended branch in admin-services-restriction.tf. +# +# Created EMPTY and PARENTLESS, on purpose: +# * EMPTY — the no-lockout guarantee. Zero members at apply time => the +# prepended policy branch matches zero existing principals => it cannot +# change anyone's authorization (contrast authentik_group "T3 Users", which +# is created WITH members atomically because THAT gate's safety property is +# the opposite). Membership is assigned at RUNTIME by the tripit-enrollment +# flow's user_write "Create users group" option (UI-managed per the ADR +# management split). Terraform owns only the group's EXISTENCE. +# * PARENTLESS — do NOT make this a child of "Allow Login Users". The sensitive +# OIDC apps gate on "Home Server Admins" / "Headscale Users" / "Wrongmove +# Users" (children of "Allow Login Users") or, for Vault, on "Allow Login +# Users" itself (bound as part of ADR-0020). Keeping External out of that +# tree is what stops these users reaching OIDC apps — mirrors guest.tf, which +# keeps the guest group out of "Allow Login Users" for the same reason. +resource "authentik_group" "tripit_external" { + name = "TripIt External" +} From aa461b95bc0a52cb2a7dff547af2f93807494676 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 21:42:30 +0000 Subject: [PATCH 35/36] feat(authentik): bind Vault OIDC app to Allow Login Users (close ADR-0020 OIDC gap) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Audit found the Vault Authentik application had no authorization binding, so any authenticated identity (incl. a future self-enrolled TripIt External user) could complete Vault OIDC login and get a built-in default-policy token. Bind it to 'Allow Login Users' — existing homelab users inherit that group via its children (verified User.all_groups() includes the parent), parentless TripIt External users are excluded. Closes the only OIDC app the forward-auth fence does not cover. Co-Authored-By: Claude Opus 4.8 --- stacks/authentik/vault-authz-binding.tf | 28 +++++++++++++++++++++++++ 1 file changed, 28 insertions(+) create mode 100644 stacks/authentik/vault-authz-binding.tf diff --git a/stacks/authentik/vault-authz-binding.tf b/stacks/authentik/vault-authz-binding.tf new file mode 100644 index 00000000..33c0af6d --- /dev/null +++ b/stacks/authentik/vault-authz-binding.tf @@ -0,0 +1,28 @@ +# Vault OIDC authorization fence (ADR-0020). The "Vault" Authentik application had +# NO authorization binding (audit 2026-06-15: any authenticated identity could +# complete Vault OIDC login and receive Vault's built-in `default`-policy token — +# token self-management/cubbyhole, no secret access, but still more than an +# outside user should hold). Bind it to "Allow Login Users" so only established +# homelab users can log in: they inherit that base group via its children +# (Home Server Admins / Headscale Users / Wrongmove Users — verified live that +# `User.all_groups()` includes the parent), while publicly self-enrolled +# "TripIt External" users (deliberately PARENTLESS, so NOT in Allow Login Users) +# are denied at the Vault consent step. Closes the one OIDC app the forward-auth +# fence cannot reach; the other sensitive OIDC apps already bind a trusted group. +# +# The Vault application itself stays UI-managed (like the other OIDC apps); this +# adds ONLY the authorization binding. policy_engine_mode on the app is "any", so +# one group binding == membership in that group is required to authorize. +data "authentik_application" "vault" { + slug = "vault" +} + +data "authentik_group" "allow_login_users" { + name = "Allow Login Users" +} + +resource "authentik_policy_binding" "vault_allow_login_users" { + target = data.authentik_application.vault.uuid + group = data.authentik_group.allow_login_users.id + order = 0 +} From 57d45d8d8ff491d89fae14a7cc054422fadd5b16 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Mon, 15 Jun 2026 22:01:29 +0000 Subject: [PATCH 36/36] fix(authentik): pin Vault binding UUIDs as literals (provider has no authentik_application data source) CI pipeline 198 failed: the pinned goauthentik/authentik provider has no data "authentik_application" source, so terraform failed the whole authentik plan and applied NOTHING (state unchanged). Replace the data-source lookups with the live pbm_uuid (Vault app) and group_uuid (Allow Login Users) as literals; authentik_policy_binding is supported (used in guest.tf). Co-Authored-By: Claude Opus 4.8 --- stacks/authentik/vault-authz-binding.tf | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/stacks/authentik/vault-authz-binding.tf b/stacks/authentik/vault-authz-binding.tf index 33c0af6d..619eba2c 100644 --- a/stacks/authentik/vault-authz-binding.tf +++ b/stacks/authentik/vault-authz-binding.tf @@ -13,16 +13,15 @@ # The Vault application itself stays UI-managed (like the other OIDC apps); this # adds ONLY the authorization binding. policy_engine_mode on the app is "any", so # one group binding == membership in that group is required to authorize. -data "authentik_application" "vault" { - slug = "vault" -} - -data "authentik_group" "allow_login_users" { - name = "Allow Login Users" -} - +# +# UUIDs are PINNED as literals: this provider version has NO +# `data "authentik_application"` data source (CI pipeline 198 failed on it), and +# both objects are UI-managed and stable. To re-fetch if either is recreated, run +# `ak shell` in the goauthentik-server pod and read +# `Application.objects.get(name="Vault").pbm_uuid` and +# `Group.objects.get(name="Allow Login Users").group_uuid`. resource "authentik_policy_binding" "vault_allow_login_users" { - target = data.authentik_application.vault.uuid - group = data.authentik_group.allow_login_users.id + target = "fe5698e3-b6b1-4475-98fa-ce2bae22f4dd" # Authentik application "Vault" (pbm_uuid) + group = "b4823cd7-8ed8-4d2f-8f94-bc285138f853" # group "Allow Login Users" (group_uuid) order = 0 }