Merge remote-tracking branch 'forgejo/master' into emo/fan-control-ha-actuator

2026-06-16 08:08:27 +00:00 · 2026-06-16 08:08:27 +00:00 · 5bc3d27d1b
commit 5bc3d27d1b
parent 2cfe338419 57d45d8d8f
42 changed files with 3072 additions and 387 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
--- a/.claude/reference/service-catalog.md
+++ b/.claude/reference/service-catalog.md
@ -42,12 +42,13 @@
 | webhook_handler | Webhook processing | webhook_handler |
 | tuya-bridge | Smart home bridge | tuya-bridge |
 | android-emulator | Shared Android 16 test emulator (adb 10.0.20.200:5555, noVNC android-emulator.viktorbarzin.lan) | android-emulator |
+| anisette | Self-hosted Apple anisette-data server (Dadoum/anisette-v3-server, digest-pinned) for sideloading the TripIt iOS Shell via SideStore; internal-only http://anisette.viktorbarzin.lan, auth=none, LAN-only, stateless | anisette |
 | dawarich | Location history | dawarich |
 | owntracks | Location tracking | owntracks |
 | nextcloud | File sync/share | nextcloud |
 | calibre | E-book management (may be merged into ebooks stack) | calibre |
 | onlyoffice | Document editing | onlyoffice |
-| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05), Woodpecker-native build->deploy (repo id 166) | f1-stream |
+| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); canonical source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05); GHA-built → `ghcr.io/viktorbarzin/f1-stream` (private), Woodpecker deploy-only (ADR-0002) | f1-stream |
 | chrome-service | Headed Chromium over CDP (`http://chrome-service.chrome-service.svc:9222`, `connect_over_cdp`; legacy `:3000/<token>` WS pool removed 2026-06-04) for sibling services driving anti-bot pages — snapshot-harvester CronJob + tripit fare scrape | chrome-service |
 | rybbit | Analytics | rybbit |
 | isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
--- a/.github/workflows/build-k8s-portal.yml
+++ b/.github/workflows/build-k8s-portal.yml
@ -0,0 +1,36 @@
+name: Build k8s-portal
+
+# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra
+# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces
+# the in-cluster .woodpecker/k8s-portal.yml build.
+on:
+  push:
+    branches: [master]
+    paths:
+      - 'stacks/k8s-portal/modules/k8s-portal/files/**'
+  workflow_dispatch: {}
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+      - uses: docker/build-push-action@v6
+        with:
+          context: stacks/k8s-portal/modules/k8s-portal/files
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/k8s-portal:latest
+            ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }}
--- a/.mcp.json
+++ b/.mcp.json
@ -3,10 +3,6 @@
    "ha": {
      "type": "http",
      "url": "${HA_MCP_URL}"
-    },
-    "paperless": {
-      "type": "http",
-      "url": "http://paperless-mcp.paperless-mcp.svc.cluster.local/mcp"
    }
  }
 }
--- a/.woodpecker/k8s-portal.yml
+++ b/.woodpecker/k8s-portal.yml
@ -1,49 +0,0 @@
-when:
-  event: push
-  branch: master
-  path:
-    include:
-      - "stacks/platform/modules/k8s-portal/files/**"
-
-clone:
-  git:
-    image: woodpeckerci/plugin-git
-    settings:
-      attempts: 5
-      backoff: 10s
-
-steps:
-  - name: build-and-push
-    image: woodpeckerci/plugin-docker-buildx
-    settings:
-      username: "viktorbarzin"
-      password:
-        from_secret: dockerhub-pat
-      repo: viktorbarzin/k8s-portal
-      dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile
-      context: stacks/platform/modules/k8s-portal/files
-      platforms:
-        - linux/amd64
-      tag: ["${CI_PIPELINE_NUMBER}", "latest"]
-      cache_from: "viktorbarzin/k8s-portal:latest"
-      cache_to: "type=inline"
-
-  - name: deploy
-    image: bitnami/kubectl:latest
-    commands:
-      - "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal"
-      - "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s"
-      - "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'"
-
-  - name: slack
-    image: curlimages/curl
-    commands:
-      - |
-        curl -s -X POST -H 'Content-type: application/json' \
-          --data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \
-          "$SLACK_WEBHOOK" || true
-    environment:
-      SLACK_WEBHOOK:
-        from_secret: slack_webhook
-    when:
-      status: [success, failure]
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -173,13 +173,17 @@ The split where every owned image is built+pushed by GitHub Actions and Woodpeck
 _Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002).

 **Canonical repo**:
-The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included.
-_Avoid_: "upstream" (ambiguous); committing anywhere else.
+The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target.
+_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003).

 **GitHub mirror**:
-The GitHub repo a **Canonical repo** push-mirrors to, one-way, so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync.
+The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost.
 _Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror.

+**GitHub-first repo**:
+The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed.
+_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**.
+
 **Forgejo registry**:
 Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io.
 _Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it.
--- a/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
+++ b/docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
@ -0,0 +1,30 @@
+# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub
+
+Status: accepted (extends ADR-0002)
+
+## Context
+
+Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/<name>`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model.
+
+The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup.
+
+## Decision
+
+Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
+
+- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
+- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
+- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
+- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
+
+## Considered options
+
+- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference.
+- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood.
+
+## Consequences
+
+- Divergence becomes structurally impossible — one push target per repo.
+- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided.
+- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.)
+- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging.
--- a/docs/architecture/authentication.md
+++ b/docs/architecture/authentication.md
@ -108,6 +108,31 @@ All new users must use an invitation link to register. The invitation-enrollment

 Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.

+### TripIt External self-signup (open enrollment, fenced)
+
+Unlike every other app, **TripIt allows open public self-signup** for people
+outside the homelab (ADR-0020 in the tripit repo; runbook
+`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment`
+flow (email + passkey, no password) creates the account and stamps it into the
+parentless **`TripIt External`** group. Containment is two-layered:
+
+- **Forward-auth apps**: a branch prepended to the `admin-services-restriction`
+  catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and
+  denies every other `auth="required"` host.
+- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth).
+  External users are contained because every sensitive OIDC app already requires a
+  trusted group they do not hold — audited 2026-06-15:
+  Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo →
+  `Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove →
+  `Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless
+  `default`-policy token) and is bound to **`Allow Login Users`** as part of this
+  change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC).
+
+**Invariants**: keep `TripIt External` parentless (never under `Allow Login
+Users`); keep the catch-all branch first; never co-assign `TripIt External` to a
+trusted/internal user; the `tripit-enrollment` user_write "Create users group"
+setting is the keystone that tags every signup.
+
 ### OIDC Applications

 Authentik provides OIDC for 10 applications:
--- a/docs/architecture/ci-cd.md
+++ b/docs/architecture/ci-cd.md
@ -2,334 +2,378 @@

 ## Overview

-The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`.
+**Doctrine (ADR-0002): all image builds and CI compute run OFF-infra.** Every
+owned image is built, tested, and linted on **GitHub Actions** (free on public
+repos; 2000 free min/mo on private) and pushed to **`ghcr.io/viktorbarzin/<name>`**.
+Woodpecker is **deploy-only** — a GHA job POSTs its API with the freshly-built
+image tag and Woodpecker runs `kubectl set image` from inside the cluster.
+There are **no in-cluster image builds or CI test runs anywhere** — the
+in-cluster Woodpecker buildkit and the fallback-build pattern were removed as a
+clean cut (ADR-0002, 2026-06-13). The Forgejo container registry is **frozen
+and emptied** — break-glass only.
+
+This breaks the old circular dependency (images needed to repair the cluster
+used to be built and stored *inside* it) and keeps build IO + registry pushes
+off the homelab spindle.

 ## Architecture Diagram

 ```mermaid
 graph LR
-    A[Git Push] --> B[GitHub Actions]
-    B --> C[Build Docker Image<br/>linux/amd64, 8-char SHA tag]
-    C --> D[Push to DockerHub]
-    D --> E[POST Woodpecker API]
-    E --> F[Woodpecker Pipeline]
-    F --> G[Vault K8s Auth<br/>SA JWT]
-    G --> H[kubectl set image]
-    H --> I[K8s Deployment]
-    I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
+    A[git push Forgejo<br/>viktor/&lt;repo&gt; canonical] --> B[push-mirror sync_on_commit]
+    B --> C[GitHub mirror<br/>ViktorBarzin/&lt;repo&gt;]
+    C --> D[GitHub Actions<br/>.github/workflows/build.yml]
+    D --> E[lint / test]
+    E --> F[buildx linux/amd64<br/>provenance:false]
+    F --> G[push ghcr.io/viktorbarzin/&lt;name&gt;<br/>:sha8 + :latest]
+    G --> H[svu tag -> Forgejo canonical]
+    G --> I[POST Woodpecker deploy repo]
+    I --> J[.woodpecker/deploy.yml<br/>event: manual]
+    J --> K[kubectl set image<br/>in-cluster SA cluster-admin]
+    K --> L[K8s Deployment<br/>pulls from ghcr]

-    K[Pull-Through Cache<br/>10.0.20.10] -.-> J
-    L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
-
-    style B fill:#2088ff
-    style F fill:#4c9e47
-    style K fill:#f39c12
+    style D fill:#2088ff
+    style J fill:#4c9e47
+    style G fill:#f39c12
 ```

 ## Components

-| Component | Version | Location | Purpose |
-|-----------|---------|----------|---------|
-| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
-| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
-| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
-| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
-| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
-| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
-| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| GitHub Actions | `.github/workflows/build.yml` (per repo) | Build + lint + test + push image; trigger deploy; cut semver tag |
+| ghcr.io | `ghcr.io/viktorbarzin/*` | Container registry for ALL owned images (public + private packages) |
+| Woodpecker CI | `ci.viktorbarzin.me` | **Deploy-only** — `kubectl set image` in-cluster; plus infra applies + maintenance crons |
+| Forgejo | `forgejo.viktorbarzin.me/viktor/<repo>` | **Canonical** git source (push-mirrors to GitHub). Container registry **FROZEN** (break-glass only) |
+| Pull-Through Cache | `10.0.20.10:5000/5010/5020/5030/5040` | LAN cache for upstream registries (DockerHub, ghcr, Quay, k8s.gcr, Kyverno) |
+| Kyverno | `kyverno` namespace | Syncs `ghcr-credentials` (private-ghcr allowlist) + `registry-credentials` to namespaces |
+| Vault | `vault.viktorbarzin.me` | K8s auth for Woodpecker deploy pipelines; CI tokens in `secret/ci/global` + `secret/viktor` |

 ## How It Works

-### Build Flow (GitHub Actions)
+### The fleet pattern (every owned app)

-1. **Trigger**: Git push to main/master branch
-2. **Build**: GHA builds Docker image for `linux/amd64` platform only
-3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`)
-   - `:latest` tags are **never used** to prevent stale pull-through cache issues
-4. **Push**: Image pushed to DockerHub public registry
-5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA
+1. **Canonical source = Forgejo** `viktor/<repo>`. A **push-mirror**
+   (`sync_on_commit`) pushes every commit to the GitHub mirror
+   `ViktorBarzin/<repo>`. The `.github/workflows/build.yml` is committed on
+   Forgejo and mirrors over.
+2. **GHA `build` job** (triggers `on: push: branches: [master]` ONLY — feature
+   branches mirror but build/deploy nothing, the safety valve):
+   - lint + test
+   - `svu` computes the next `vX.Y.Z` from conventional commits and pushes the
+     tag back to **canonical Forgejo** (GHA secret `FORGEJO_GIT_TOKEN` =
+     write:repository PAT); `VERSION` is baked into the image
+   - `docker buildx` `linux/amd64`, **`provenance: false`** (single-manifest —
+     avoids the orphaned-index-children failure class), push
+     `ghcr.io/viktorbarzin/<name>:<sha8>` + `:latest`
+   - `delete-package-versions` keeps the newest ~10 ghcr versions
+3. **GHA `deploy` job** POSTs `ci.viktorbarzin.me/api/repos/<id>/pipelines`
+   (the Woodpecker registration for the **GitHub mirror**, github-forge; GHA
+   secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + `IMAGE_NAME`.
+4. **`.woodpecker/deploy.yml`** (event: **manual** only, so the raw
+   Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set
+   image deployment/<app> <container>=<image>` in-cluster. The `woodpecker-agent`
+   SA is `cluster-admin`, so the `bitnami/kubectl` step needs no
+   kubeconfig/RBAC. The Deployment image is in `lifecycle.ignore_changes`
+   (`KEEL_IGNORE_IMAGE`) so the SHA tag sticks and `terragrunt apply` doesn't
+   fight it. CronJobs in owned apps track `:latest` + `imagePullPolicy: Always`
+   instead of a deploy step.

-### Deploy Flow (Woodpecker CI)
+**Keel stays enrolled** as a redundant net (finds the deployed SHA already
+running → no-op).

-1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA
-2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth
-3. **Deploy**: `kubectl set image deployment/<name> <container>=viktorbarzin/<app>:<sha>`
-4. **Notify**: Slack notification on success/failure
+**Tooling**: `infra/scripts/offinfra-onboard` + `infra/scripts/offinfra-templates/`
+scaffold a repo onto this pattern (mirror, workflow, Woodpecker deploy repo,
+old-pipeline removal, default-branch flip). Mirror + workflow commits go via
+the Forgejo API over the internal Traefik LB
+(`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm
+can't reach Forgejo's public hairpin.

-### Project Migration Status
+### ghcr package visibility

-**Migrated to GHA (8 projects)**:
- Website
- k8s-portal
- claude-memory-mcp
- apple-health-data
- audiblez-web
- plotting-book
- insta2spotify
- book-search (audiobook-search)
+| Visibility | Packages | Pull mechanism |
+|------------|----------|----------------|
+| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
+| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |

-**Woodpecker-native owned-app builds** (build + push to the Forgejo private
-registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel
-stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`.
-`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on
-2026-06-05 (Woodpecker repo id 166); the old github source is archived and its
-GHA-era Woodpecker repo (id 10) is deactivated.
+Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
+kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
+**ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source
+`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault
+`secret/viktor/ghcr_pull_token` (a dedicated classic PAT scoped to
+`read:packages`, UI-minted 2026-06-15 — no longer the admin `github_pat` alias.
+GitHub has no token-mint API, so rotation is manual: re-mint the classic
+`read:packages` PAT → `vault kv patch secret/viktor ghcr_pull_token=…` →
+targeted apply `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault;
+avoids the git-crypt `tls-secret-sync` landmine on a locked clone), which
+Kyverno then re-syncs to the allowlisted namespaces).

-**Woodpecker-only (infra + large apps)**:
- `travel_blog`: 5.7GB content directory exceeds GHA limits
- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli)
+### Migrated apps (issues #13–#27)

-### Woodpecker Pipeline Files
+f1-stream, job-hunter, tuya_bridge, beadboard, nextcloud-todos,
+claude-agent-service, claude-memory-mcp, kms-website, Freedify,
+instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
+fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
+pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
+k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
+audiobook-search, council-complaints) now also land on ghcr.

-Each project contains:
- `.woodpecker/deploy.yml`: kubectl set image + Slack notification
- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires)
+### Infra-owned images (issues #29 / #30)

-### Woodpecker Repository IDs
+Images owned by the infra repo build on GHA workflows **in the infra repo's own
+`.github/workflows/`** (the github↔forgejo divergence was deliberately NOT
+reconciled — the workflows were added to the GitHub lineage via PR):

-Woodpecker API uses numeric IDs (not owner/name):
+| Image | Workflow | Destination |
+|-------|----------|-------------|
+| chrome-service-novnc | `build-chrome-service-novnc.yml` | public `ghcr.io/viktorbarzin/chrome-service-novnc` |
+| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
+| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
+| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |

-| Repo | ID |
-|------|------|
-| infra | 1 |
-| Website | 2 |
-| finance | 3 |
-| health | 4 |
-| travel_blog | 5 |
-| webhook-handler | 6 |
-| audiblez-web | 9 |
-| plotting-book | 43 |
-| claude-memory-mcp | 78 |
-| infra-onboarding | 79 |
+**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
+`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
+already built by tripit's GHA → ghcr.

-### Image Registry Flow
+The Woodpecker `build-ci-image.yml` and `build-cli.yml` pipelines were
+**REMOVED**. Break-glass for infra-ci is now a manual
+`.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM).

-1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
-2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
-3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
-4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
-5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
+### Forgejo container registry — FROZEN

-### Infra Pipelines (Woodpecker-only)
+Issue #32 wiped all `viktor/*` container packages (~19G reclaimed, `/data`
+58%→20%). The registry is **break-glass-only** now; nothing pushes to it. The
+`forgejo-cleanup` CronJob stays in `DRY_RUN` (nothing to clean). Pull-through
+caches on the registry VM (`10.0.20.10`) are unchanged. See
+`docs/runbooks/forgejo-registry-breakglass.md`.
+
+### Image registry / pull path
+
+1. **Containerd `hosts.toml`** redirects pulls from docker.io and ghcr.io to the
+   pull-through cache at `10.0.20.10` (5000 = docker.io, 5010 = ghcr.io).
+2. **Pull-through cache** serves cached images from the LAN, fetches upstream on
+   a miss.
+3. **Kyverno ClusterPolicies** sync `ghcr-credentials` (private-ghcr allowlist)
+   and `registry-credentials` to namespaces.
+
+## Woodpecker — what it still runs
+
+Woodpecker is **deploy + cluster-touching steps only**:

 | Pipeline | File | Purpose |
 |----------|------|---------|
-| default | `.woodpecker/default.yml` | Terragrunt apply on push |
-| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
-| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
-| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
-| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
-| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
-| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host |
-| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
-| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
-| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
+| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
+| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
+| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
+| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
 | provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
+| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change |
+| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE |
+| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
+| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new post-mortems |
+| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered deploy for the portal |
+| breakglass-infra-ci | `.woodpecker/breakglass-infra-ci.yml` | **Manual** ghcr pull-and-save of infra-ci to the registry VM |
+
+**No build/test pipeline exists on any repo.** Do not (re)introduce one.
+
+### Woodpecker API
+
+Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
+(those return HTML). The deploy registration for each app is the **GitHub
+mirror** repo (registered github-forge). IDs are stable across renames and must
+be looked up from the Woodpecker UI/DB.
+
+### Woodpecker YAML gotchas
+
+- Commands with `${VAR}:${VAR}` must be **quoted** — an unquoted `:` triggers
+  YAML map parsing when the vars are empty.
+- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility).
+- Global secrets must include `manual` in their events list for API-triggered
+  pipelines.
+
+### GitHub repo secrets
+
+Per repo: `WOODPECKER_TOKEN` (POST the deploy pipeline), `FORGEJO_GIT_TOKEN`
+(write:repository PAT for the `svu` tag push). ghcr push uses the workflow's
+built-in `GITHUB_TOKEN` (`packages: write`).
+
+## Infra repo CI topology
+
+The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
+forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
+1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
+(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
+Slack audit step. Operational facts (2026-06-10):
+
+- **Webhook URL is the IN-CLUSTER service**:
+  `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
+  via the Forgejo API). The Woodpecker default (`https://ci.viktorbarzin.me/...`)
+  resolves to the non-proxied public A record from pods → NAT hairpin →
+  intermittent `context deadline exceeded`, silently dropping push events. If
+  Woodpecker "repairs" the repo it rewrites the hook back to `ci.viktorbarzin.me`
+  — re-apply the in-cluster URL.
+- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
+  repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, …).
+  When registering a new forge repo for infra, clone the secret set too.
+- **Empty commits defeat path filters**: a commit with no changed files makes
+  Woodpecker include ALL workflow files (path conditions can't exclude), so every
+  repo secret must resolve. Normal commits with real files only compile the
+  matching workflows.
+
+The Forgejo trigger is not fully dependable — land infra changes by pushing
+Forgejo master (as viktor), use `[ci skip]` for docs/no-op commits, and verify
+deploys via `scripts/tg` + live cluster state rather than trusting the CI
+checkmark. The two remotes have **diverged** (parallel histories under
+different SHAs); expect github pushes to reject non-fast-forward and leave them
+— never force-push.

 ## Configuration

-### GitHub Actions
-
-**File**: `.github/workflows/build-and-deploy.yml`
+### GitHub Actions (per-app `.github/workflows/build.yml`)

 ```yaml
-name: Build and Deploy
+name: build
 on:
  push:
-    branches: [main, master]
+    branches: [master]
 jobs:
  build:
    runs-on: ubuntu-latest
+    permissions:
+      contents: write   # svu tag push
+      packages: write    # ghcr push
    steps:
-      - name: Build Docker image
-        run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} .
-      - name: Push to DockerHub
-        run: docker push viktorbarzin/app:${SHORT_SHA}
-      - name: Trigger Woodpecker Deploy
+      - uses: actions/checkout@v4
+      - name: lint + test
+        run: make lint test
+      - name: svu tag -> Forgejo
        run: |
-          curl -X POST https://ci.viktorbarzin.me/api/repos/<REPO_ID>/pipelines \
-            -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}"
+          VERSION=$(svu next)
+          # ... push tag to canonical Forgejo with FORGEJO_GIT_TOKEN
+      - uses: docker/setup-buildx-action@v3
+      - uses: docker/build-push-action@v6
+        with:
+          platforms: linux/amd64
+          provenance: false
+          push: true
+          tags: |
+            ghcr.io/viktorbarzin/<name>:${{ github.sha }}
+            ghcr.io/viktorbarzin/<name>:latest
+  deploy:
+    needs: build
+    runs-on: ubuntu-latest
+    steps:
+      - name: Trigger Woodpecker deploy
+        run: |
+          curl -X POST https://ci.viktorbarzin.me/api/repos/<DEPLOY_REPO_ID>/pipelines \
+            -H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \
+            -d '{"branch":"master","variables":{"IMAGE_TAG":"...","IMAGE_NAME":"..."}}'
 ```

-**Required GitHub Secrets**:
- `DOCKERHUB_USERNAME`
- `DOCKERHUB_TOKEN`
- `WOODPECKER_TOKEN`
-
-### Woodpecker Deploy Pipeline
-
-**File**: `.woodpecker/deploy.yml`
+### Woodpecker deploy pipeline (per-app `.woodpecker/deploy.yml`)

 ```yaml
 when:
-  event: [deployment]
+  event: manual

 steps:
  deploy:
-    image: bitnami/kubectl:latest
+    image: bitnami/kubectl:latest   # uses the in-cluster woodpecker-agent SA (cluster-admin)
    commands:
-      - kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8}
-    secrets: [k8s_token]
-
+      - "kubectl set image deployment/app app=${IMAGE_NAME}:${IMAGE_TAG} -n <ns>"
+      - "kubectl rollout status deployment/app -n <ns> --timeout=300s"
  notify:
    image: plugins/slack
-    settings:
-      webhook: ${SLACK_WEBHOOK}
    when:
      status: [success, failure]
 ```

-**YAML Gotchas**:
- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty
- Use `bitnami/kubectl:latest` (not pinned versions)
- Global secrets must be manually added to `secrets:` list in pipeline
+### CI/CD secrets sync

-### Vault Configuration
-
-**K8s Auth for Woodpecker**:
- Woodpecker pipelines authenticate using ServiceAccount JWT
- Vault K8s auth mount validates JWT and issues token
- Policies grant access to secrets and dynamic credentials
-
-### CI/CD Secrets Sync
-
-**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours
- Keeps Woodpecker global secrets in sync with Vault
- Runs in `woodpecker` namespace
-
-## Infra repo CI (Woodpecker repo 82 — Forgejo forge)
-
-The infra repo itself runs on Woodpecker via the **Forgejo** forge (repo id 82,
-registered 2026-06-08; the GitHub-side repo id 1 also remains registered).
-Pushes to `master` fire `.woodpecker/default.yml` (changed-stacks terragrunt
-apply) plus the `notify-nonadmin-push` Slack audit step (allow-then-audit
-contribution model — see `multi-tenancy.md`). Operational facts (2026-06-10):
-
- **Webhook URL is the IN-CLUSTER service**: `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...`
-  (PATCHed via the Forgejo API). The Woodpecker-generated default
-  (`https://ci.viktorbarzin.me/...`) resolves to the non-proxied public A
-  record from pods → NAT hairpin → intermittent `context deadline exceeded`,
-  silently dropping push events (found when a push produced no pipeline).
-  If Woodpecker ever "repairs" the repo it will rewrite the hook back to
-  `ci.viktorbarzin.me` — re-apply the in-cluster URL (or pin `ci.viktorbarzin.me`
-  in the CoreDNS pod carve-out alongside forgejo).
- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
-  repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`,
-  …). Repo 82 was registered without them and every all-workflow compile
-  errored with `secret "registry_ssh_key" not found`. Fixed by cloning repo-1
-  rows to repo 82 in the Woodpecker DB (`insert into secrets … select … where
-  repo_id=1`). When registering a new forge repo for infra, clone the secret
-  set too.
- **Empty commits defeat path filters**: a commit with no changed files makes
-  Woodpecker include ALL workflow files (path conditions can't exclude), so
-  every repo secret must resolve. Normal commits with real files only compile
-  the matching workflows.
+A CronJob in the `woodpecker` namespace pushes `secret/ci/global` from Vault →
+the Woodpecker API every 6h, keeping global secrets in sync. Woodpecker deploy
+pipelines authenticate to the cluster via the in-cluster `woodpecker-agent` SA
+(cluster-admin); Vault K8s auth backs any secret reads.

 ## Decisions & Rationale

-### Why GitHub Actions + Woodpecker?
+### Why all builds off-infra (ADR-0002)?

-**Alternatives considered**:
-1. **Woodpecker-only**: Simple, but wastes cluster resources on builds
-2. **GHA-only**: No cluster access, requires kubectl from outside (security risk)
-3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access)
+- **Breaks the circular dependency** — the images needed to repair the cluster
+  no longer live inside it (they're on ghcr, an external registry).
+- **Removes build IO + registry push load** from the contended homelab spindle.
+- GHA is free on public repos and generous on private; buildx provenance:false
+  sidesteps the orphaned-index-children failure class that plagued the
+  in-cluster registry.
+- **Clean cut** — no in-cluster fallback builds anywhere; one pattern,
+  fleet-wide.

-**Benefits**:
- Free compute for builds on public repos
- Cluster access stays internal (Woodpecker has direct K8s access)
- Separation of concerns: build vs deploy
+### Why ghcr (not push back to Forgejo)?

-### Why 8-Character SHA Tags (Not :latest)?
+Forgejo's container registry repeatedly orphaned OCI index children
+(2026-04-13/19, 2026-05-04, 2026-06-10) and its retention is not container-aware.
+ghcr is external (DR-safe), free for this scale, and has native multi-arch
+handling. The Forgejo registry was frozen + emptied (issue #32).

- Pull-through cache serves stale `:latest` tags indefinitely
- SHA tags ensure every deployment pulls the correct image
- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations)
+### Why Woodpecker stays for deploy?

-### Why Numeric Repo IDs for Woodpecker API?
+`kubectl set image` needs in-cluster privileged access; doing it from GHA would
+mean exposing kube-apiserver or a long-lived kubeconfig. Woodpecker's
+`woodpecker-agent` SA is already cluster-admin in-cluster — the deploy step
+needs no credentials.

- Woodpecker API requires numeric IDs (not owner/name slugs)
- IDs are stable across repo renames
- Must be manually looked up from Woodpecker UI or database
+### Why `event: manual` on deploy.yml?

-### Why linux/amd64 Only?
+The Forgejo→GitHub push-mirror sends raw, tag-less pushes to the GitHub mirror.
+If `deploy.yml` fired on `push`, every mirror sync would trigger a deploy with no
+image tag. `manual` means only the GHA `deploy` job's explicit API POST (with
+`IMAGE_TAG`) deploys.

- Cluster runs on x86_64 nodes only
- ARM builds would waste time and storage
- Multi-arch images add complexity without benefit
+### Why linux/amd64 only?
+
+The cluster runs on x86_64 nodes only; ARM builds waste time and storage.

 ## Troubleshooting

-### GHA Build Fails: "denied: requested access to the resource is denied"
+### GHA build fails: ghcr push "denied"

-**Cause**: DockerHub credentials expired or incorrect
+The workflow `GITHUB_TOKEN` needs `packages: write` permission and the package
+must allow the repo to push. Check the workflow `permissions:` block and the
+package's "Manage Actions access" settings.
+
+### Image pull fails: "ErrImagePull" / "ImagePullBackOff"

-**Fix**:
 ```bash
-# Regenerate DockerHub token
-# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN
+# Public image — check the pull-through cache is up
+curl http://10.0.20.10:5010/v2/_catalog
+
+# Private image — verify the ghcr-credentials Secret exists in the namespace
+kubectl get secret ghcr-credentials -n <namespace>
+# It's Kyverno-synced to an allowlist; if missing, the namespace isn't on the
+# allowlist in stacks/kyverno/modules/kyverno/ghcr-credentials.tf
 ```

-### Woodpecker Deploy Fails: "Unauthorized"
+If the cause is the internal-DNS hairpin (fresh pulls timing out on the public
+Forgejo path), see the CoreDNS `viktorbarzin.me` carve-out in
+`docs/architecture/networking.md` and `docs/runbooks/registry-vm.md`.

-**Cause**: Vault K8s auth token expired or invalid
+### Deploy didn't happen after a push

-**Fix**:
-```bash
-# Restart Woodpecker pipeline (token auto-renewed)
-# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer
-```
+Confirm the push was to **master** (feature branches build/deploy nothing).
+Check the GHA run completed the `deploy` job, then check Woodpecker received the
+manual pipeline (`ci.viktorbarzin.me`, the GitHub-mirror deploy repo). Verify
+live with `kubectl rollout status` — not the CI checkmark.

-### Image Pull Fails: "ErrImagePull"
+### Woodpecker deploy fails: "YAML: did not find expected key"

-**Cause**: Pull-through cache or registry credentials issue
-
-**Fix**:
-```bash
-# Check pull-through cache is running
-curl http://10.0.20.10:5000/v2/_catalog
-
-# Verify registry-credentials Secret exists in namespace
-kubectl get secret registry-credentials -n <namespace>
-
-# Manually sync credentials if missing
-kubectl get secret registry-credentials -n default -o yaml | \
-  sed 's/namespace: default/namespace: <namespace>/' | kubectl apply -f -
-```
-
-### Woodpecker Pipeline: "YAML: did not find expected key"
-
-**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty
-
-**Fix**: Quote the command:
-```yaml
-commands:
-  - "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}"
-```
-
-### travel_blog Build Times Out on GHA
-
-**Cause**: 5.7GB content directory exceeds GHA disk/time limits
-
-**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources.
-
-### CI/CD Secrets Out of Sync
-
-**Cause**: CronJob failed to sync Vault → Woodpecker
-
-**Fix**:
-```bash
-# Check CronJob status
-kubectl get cronjob -n woodpecker
-
-# Manually trigger sync
-kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker
-```
+Unquoted command with `${VAR}:${VAR}` syntax when a VAR is empty. Quote the
+command (see the deploy.yml example above).

 ## Related

- [Databases Architecture](./databases.md) — Database credentials via Vault
- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access
- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app
- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues
- Vault documentation: K8s auth configuration
- Woodpecker documentation: API reference
+- ADR: `../adr/0002-all-image-builds-off-infra-gha-ghcr.md` — the decision
+- [Databases Architecture](./databases.md) — database credentials via Vault
+- [Multi-Tenancy](./multi-tenancy.md) — per-user Woodpecker access
+- Runbook: `../runbooks/forgejo-registry-breakglass.md` — using the frozen registry
+- Runbook: `../runbooks/registry-vm.md` — pull-through cache VM + image-pull debugging
+- Onboarding tool: `../../scripts/offinfra-onboard` + `../../scripts/offinfra-templates/`
--- a/docs/architecture/multi-tenancy.md
+++ b/docs/architecture/multi-tenancy.md
@ -543,6 +543,10 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1

 **Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.

+**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).
+
+**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
+
 **Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).

 **Contribute access (per non-admin, manual — the anca/tripit PAT precedent):**
--- a/docs/runbooks/tripit-external-signup.md
+++ b/docs/runbooks/tripit-external-signup.md
@ -0,0 +1,226 @@
+# Runbook — TripIt external user self-signup (email + passkey)
+
+Implements ADR-0020 (tripit repo): people outside the homelab self-register to
+TripIt with **email + a passkey** (no password), are auto-tagged into the
+**`TripIt External`** Authentik group, and are fenced to `tripit.viktorbarzin.me`
+only. Audience: people Viktor knows; open public registration.
+
+> **Safety model.** Containment is two-layered. (1) **Forward-auth apps** — the
+> branch in `stacks/authentik/admin-services-restriction.tf` admits `TripIt
+> External` to `tripit.viktorbarzin.me` and denies every other `auth="required"`
+> host. (2) **OIDC apps** — the branch does NOT cover OIDC (it bypasses
+> forward-auth); External users are contained because every sensitive OIDC app
+> already requires a trusted group they do not hold (audit below). The no-lockout
+> guarantee is that the group is created **empty**, so the new branch matches
+> zero existing users on day one.
+
+## OIDC app authorization audit (2026-06-15, read-only)
+
+A parentless `TripIt External` user holds NONE of these groups, so:
+
+| OIDC app | Requires | External user |
+|---|---|---|
+| Immich, Grafana, Linkwarden, Cloudflare Access | `Home Server Admins` | DENIED ✓ |
+| Forgejo | `Task Submitters` / `Forgejo Users` | DENIED ✓ |
+| Headscale | `Headscale Users` | DENIED ✓ |
+| wrongmove | `Wrongmove Users` | DENIED ✓ |
+| **Vault** | **was OPEN** → bound to `Allow Login Users` in Step 3 | DENIED after Step 3 |
+| Kubernetes, Kubernetes Dashboard | OPEN | harmless — apiserver rejects OIDC tokens (idle) |
+| TripIt App, Public | OPEN | by design (TripIt's own provider / guest) |
+
+Vault's JWT `default` role grants only Vault's built-in `default` policy (token
+self-management, cubbyhole — **no** secret access), so the pre-fix exposure was a
+near-powerless token; Step 3 closes it anyway.
+
+---
+
+## Pre-flight gates (STOP if any fails)
+
+1. **`TripIt External` is net-new / empty** (no-lockout precondition):
+   ```
+   kubectl -n authentik exec -i deploy/goauthentik-server -- ak shell <<'PY'
+   from authentik.core.models import Group
+   g = Group.objects.filter(name="TripIt External").first()
+   print("exists:", bool(g), "members:", g.users.count() if g else 0)
+   PY
+   ```
+   Expect `exists: False`. If it exists with members → STOP.
+2. **Authentik image pin matches live (B5)** — the policy edit auto-applies the
+   whole `authentik` stack; a stale pin re-triggers the 2026-06-10 downgrade
+   boot-storm:
+   ```
+   kubectl -n authentik get deploy -o custom-columns=N:.metadata.name,IMG:.spec.template.spec.containers[0].image
+   ```
+   Every `goauthentik`/`ak-outpost` image tag MUST equal
+   `stacks/authentik/modules/authentik/values.yaml` `global.image.tag`
+   (currently `2026.2.4`). If they differ → refresh the pin first.
+
+---
+
+## Step 1 — Terraform (group + fence branch)
+
+Already written on this branch:
+- `stacks/authentik/tripit-external.tf` — the empty, parentless group.
+- `stacks/authentik/admin-services-restriction.tf` — the prepended fence branch.
+
+**Local plan gate (B4 — CI auto-applies on push with `-auto-approve`, so there is
+NO human plan review in the apply path; do it here):**
+```
+vault login -method=oidc
+cd stacks/authentik && ../../scripts/tg plan
+```
+Confirm the plan is **exactly**:
+- `+ authentik_group.tripit_external` (create)
+- `~ authentik_policy_expression.admin_services_restriction` (update in place — the
+  `expression` body gains ONLY the new branch; every other line byte-identical)
+- **`Plan: 1 to add, 1 to change, 0 to destroy.`**
+
+ABORT if the plan shows any destroy/replace, any `authentik_provider_*` /
+`authentik_outpost` / `authentik_flow*` / `helm_release`, or any other expression
+change.
+
+**Apply** (presence-claim courtesy, then push = apply; land human-watched, B5):
+```
+~/code/scripts/presence claim stack:authentik --purpose "ADR-0020 TripIt External group + fence branch"
+# push the branch to master (this triggers CI tg apply on the authentik stack)
+```
+Watch: GHA → Woodpecker `default.yml` apply → outpost stays healthy
+(`kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost` = 2
+IPs; an anonymous request to any `auth=required` host still 302s to Authentik).
+The branch is inert (empty group) so no access changes yet.
+
+---
+
+## Step 2 — Authentik SMTP (B1, BLOCKER before any flow)
+
+Email verification is the **entire identity boundary** (TripIt trusts the
+Authentik email verbatim). Authentik currently has the **default/unconfigured**
+transport (`email.host = localhost`), so verification/recovery mail cannot send.
+
+Add to **both** `server.env` and `worker.env` in
+`stacks/authentik/modules/authentik/values.yaml` (wire the password from a secret;
+the cluster mailserver is what TripIt already relays through —
+`mailserver.mailserver.svc`):
+```yaml
+    - { name: AUTHENTIK_EMAIL__HOST,     value: "mailserver.mailserver.svc" }
+    - { name: AUTHENTIK_EMAIL__PORT,     value: "587" }
+    - { name: AUTHENTIK_EMAIL__USE_TLS,  value: "true" }
+    - { name: AUTHENTIK_EMAIL__FROM,     value: "noreply@viktorbarzin.me" }
+    - { name: AUTHENTIK_EMAIL__USERNAME, value: "<relay user>" }      # confirm relay creds
+    - { name: AUTHENTIK_EMAIL__PASSWORD, valueFrom: { secretKeyRef: { name: <secret>, key: <key> } } }
+```
+**Gate:** after apply, Authentik UI → System → Settings (or an Email stage) →
+**Send test email**; it must arrive. Then prove enrollment cannot complete for an
+address you do NOT control.
+
+---
+
+## Step 3 — Bind Vault → `Allow Login Users` (close the one open OIDC gap)
+
+Authentik UI → Applications → **Vault** → bind an authorization policy requiring
+group **`Allow Login Users`** (the base group every real homelab user inherits;
+parentless `TripIt External` is excluded). This changes nothing for existing
+users and denies External users at the Vault consent step.
+Verify: an External test account (Step 6) cannot complete Vault OIDC login.
+
+---
+
+## Step 4 — Build the flows (Authentik UI; UI-managed per ADR split)
+
+All three flows: designation as noted, no password stage.
+
+**Flow `tripit-enrollment`** (Enrollment):
+| Order | Stage | Key settings |
+|---|---|---|
+| 5  | Captcha | reCAPTCHA **v2 checkbox** keys (v3/invisible fail — see `crowdsec-recaptcha-key-type`) |
+| 10 | Identification | email only; **no** `password_stage`; `sources` optional |
+| 20 | Email (verification) | activate, blocking — **before** user_write |
+| 30 | WebAuthn authenticator setup | `user_verification = required`, `resident_key = required` |
+| 40 | User Write | **`create_users_group` = `TripIt External`** (the keystone tag); `user_type = external` |
+| 50 | User Login | session as default (`weeks=4`) |
+
+**Flow `tripit-login`** (Authentication, passwordless):
+Identification (sets `enrollment_flow`/`recovery_flow`) → Authenticator
+Validation (`device_classes = [webauthn]`, `user_verification = required`) → User
+Login. Prefer routing a passkey-less email to recovery over minting a credential.
+
+**Flow `tripit-recovery`** (Recovery):
+Identification (`pretend_user_exists = on`) → Email (recovery link) → WebAuthn
+authenticator setup → User Login. Notify the account on recovery + new-passkey.
+
+> Do **NOT** bind the `brute-force-protection` ReputationPolicy to these flows —
+> it denies anonymous users (2026-04-06 regression). The Captcha is the bot gate.
+
+---
+
+## Step 5 — Surface "Sign up"
+
+Recommended: a **TripIt-scoped** signup link / share-invite rather than a global
+login-screen button (narrower bot surface). Enrollment URL:
+`https://authentik.viktorbarzin.me/if/flow/tripit-enrollment/`.
+
+---
+
+## Step 6 — Verification (before/after — "all access keeps working")
+
+Hosts for the matrix (must be real `auth="required"` default-allow hosts, NOT
+`auth="app"` apps like immich/nextcloud which bypass the catch-all):
+`tripit`, `family`, `hackmd`, `health` (default-allow) + `terminal` (admin-only).
+
+**Before** (capture per user, no redirect-follow; 200=ALLOW, 302→authentik/403=DENY):
+```
+COOKIE='authentik_session=<paste for this user>'; for H in tripit family hackmd health terminal; do
+  printf '%-10s %s\n' "$H" "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: $COOKIE" https://$H.viktorbarzin.me/)"; done
+```
+Representative non-admin: `kadir.tugan@gmail.com` (Wrongmove-only) → tripit/family/hackmd/health ALLOW, terminal DENY. Admin `vbarzin@gmail.com` → all ALLOW.
+
+**After Step 1 apply — regression:** re-run identically; both users' results MUST
+be unchanged (diff empty).
+
+**After flows — external smoke test (the security proof):** enrol a throwaway
+account via the enrollment URL (email verify + passkey). Confirm it is tagged
+`TripIt External`, then with its cookie:
+```
+for H in tripit family hackmd health terminal frigate; do printf '%-10s %s\n' "$H" \
+  "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: authentik_session=<external>" https://$H.viktorbarzin.me/)"; done
+```
+Expect **tripit=200, every other host DENY** (family/hackmd/health were ALLOW for
+kadir — the contrast is the fence proof). Then:
+- **OIDC containment:** with the external account, attempt OIDC login to Vault,
+  Immich, Forgejo, Grafana → each must be DENIED at the app's own login.
+- **Auto-provision:** the TripIt `users` row exists (CNPG primary in ns `dbaas`:
+  `select id,email from tripit.users where email='<throwaway>'`).
+- **Walling-off guard** `AuthentikWallingOffPublicPath` stays green.
+
+**Any 200 on a non-tripit host, or any OIDC app admitting the external account →
+ROLLBACK.**
+
+---
+
+## Step 7 — Standing regression probe (recommended)
+
+Add a permanent `TripIt External` identity to the `blackbox-exporter` guard
+(`stacks/monitoring/.../authentik_walloff_probe.tf` pattern): assert 200 on
+`tripit.viktorbarzin.me` AND DENY on `family.viktorbarzin.me`. This converts the
+"branch stays first" and "user_write keeps the keystone tag" invariants into
+automated `#security` alerts.
+
+---
+
+## Rollback
+
+Revert the `admin-services-restriction.tf` expression (delete the branch) and push
+(= apply); removing a prepended `if g: return …` is behaviour-preserving on
+non-members, restoring prior authz. Disable/delete the throwaway external account
+(with the branch gone, a tagged account falls into default-allow). The empty group
+may stay (harmless). Plan-gate the revert too.
+
+## Operational invariants
+
+- `TripIt External` stays **parentless** (never under `Allow Login Users`).
+- The fence branch stays **first** in `admin-services-restriction`.
+- **Never** co-assign `TripIt External` to a trusted/internal user.
+- The `tripit-enrollment` user_write **`create_users_group`** setting is the
+  keystone — re-verify after any flow edit (clearing it makes UNtagged accounts
+  that fall into default-allow).
+- Authentik SMTP is a live dependency of enrollment + recovery.
--- a/scripts/t3-provision-users.sh
+++ b/scripts/t3-provision-users.sh
@ -270,6 +270,43 @@ install_user_claude_token() {
  log "shared Claude token -> $user (t3-serve env; restart needed to take effect)"
 }

+# Re-deploy the managed per-user Claude launcher to ~/start-claude.sh. /etc/skel only
+# seeds it at account creation (setup-devvm.sh), so without this a launcher edit never
+# reaches EXISTING users — they keep running a stale copy. Copy-if-changed from the repo's
+# skel/, owned by the user, 0755. (We deliberately do NOT re-copy .tmux.conf: terminal-lobby
+# appends a managed persistence section to each user's ~/.tmux.conf that a re-copy would clobber.)
+deploy_user_launcher() {
+  local user="$1" home src dst
+  src="$WORKSTATION_DIR/skel/start-claude.sh"
+  home="$(getent passwd "$user" | cut -d: -f6)"
+  [[ -n "$home" && -d "$home" && -f "$src" ]] || return 0
+  dst="$home/start-claude.sh"
+  cmp -s "$src" "$dst" 2>/dev/null && return 0          # already current -> no churn
+  if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] deploy launcher -> $dst"; return 0; fi
+  install -m 0755 "$src" "$dst"
+  chown "$user:$user" "$dst"
+  log "deployed start-claude.sh -> $user"
+}
+
+# Ensure the per-user NATIVE claude install (the recommended runtime: ~user/.local/bin/claude,
+# self-updating) — used by BOTH the terminal launcher AND the user's t3-serve instance. We do
+# NOT npm-install claude system-wide (npm/npx isn't the recommended runtime); each user gets
+# their own native install. Idempotent: skip if already present. Runs the official native
+# installer AS the user (into their ~/.local). Best-effort: a failure WARNs and retries next
+# reconcile (start-claude.sh also self-bootstraps the terminal path).
+install_user_claude_native() {
+  local user="$1" home
+  home="$(getent passwd "$user" | cut -d: -f6)"
+  [[ -n "$home" && -d "$home" ]] || return 0
+  [[ -x "$home/.local/bin/claude" ]] && return 0          # already native -> done
+  if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] native claude install -> $user"; return 0; fi
+  if runuser -u "$user" -- bash -lc 'curl -fsSL https://claude.ai/install.sh | bash' >/dev/null 2>&1; then
+    log "installed native claude -> $user"
+  else
+    log "WARN: native claude install failed for $user (retries next reconcile)"
+  fi
+}
+
 [[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; }
 for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done
 [[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; }
@ -346,8 +383,10 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
    fi
    install_user_kubeconfig "$os_user"
    install_user_claude_token "$os_user"
+    deploy_user_launcher "$os_user"          # keep ~/start-claude.sh current (skel only seeds new accounts)
  fi
  refresh_codex_mirror "$os_user"            # all tiers — mirror of the managed claudeMd
+  install_user_claude_native "$os_user"      # all tiers — per-user native claude (terminal + t3); no npm/npx
 done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (if (.groups|length)==0 then "-" else (.groups|join(",")) end), .code_layout, (if (.repos|length)==0 then "-" else (.repos|join(",")) end)] | @tsv' "$desired_file")

 # 5) per-user .env (sticky port) + enable t3-serve@
--- a/scripts/workstation/setup-devvm.sh
+++ b/scripts/workstation/setup-devvm.sh
@ -21,7 +21,13 @@ export DEBIAN_FRONTEND=noninteractive
 apt-get update -qq
 apt-get install -y "${PKGS[@]}" >/dev/null

-# 2) node >= 18 + claude-code (claude-code requires node >= 18)
+# 2) node >= 18 — needed for the t3 CLI (npm-global, below). NOT for claude-code:
+#    claude-code is the per-user NATIVE install (the recommended, self-updating
+#    ~/.local/bin/claude), provisioned per user by t3-provision-users
+#    (install_user_claude_native) and self-bootstrapped by start-claude.sh on first launch.
+#    We deliberately do NOT `npm install -g @anthropic-ai/claude-code` — npm/npx is not the
+#    recommended runtime, and a system-wide npm copy just shadows/duplicates the per-user
+#    native installs everyone auto-migrates to anyway.
 need_node=1
 if command -v node >/dev/null; then
  [[ "$(node -v | sed 's/^v\([0-9]*\).*/\1/')" -ge 18 ]] && need_node=0
@ -31,14 +37,23 @@ if [[ $need_node -eq 1 ]]; then
  curl -fsSL https://deb.nodesource.com/setup_22.x | bash - >/dev/null
  apt-get install -y nodejs >/dev/null
 fi
-# Detect the GLOBAL npm package, NOT whatever `claude` resolves to on PATH: the admin's
-# personal ~/.local/bin/claude shadows it, so `command -v claude` silently skipped the
-# system-wide install — leaving /usr/lib/node_modules/@anthropic-ai empty and fresh
-# non-admins with no claude (they only worked because the admin's install was on PATH).
-if ! npm ls -g --depth=0 @anthropic-ai/claude-code >/dev/null 2>&1; then
-  log "npm: installing @anthropic-ai/claude-code (system-wide)"
-  npm install -g @anthropic-ai/claude-code >/dev/null
-fi
+
+# 2a) ~/.local/bin on PATH for all LOGIN shells (machine-wide). The native claude install
+#     lives at ~/.local/bin; this guarantees login shells (SSH, etc.) find it regardless of
+#     whether the per-user native-installer rc edit ran. (The terminal launcher sets PATH
+#     itself, and t3-serve@.service hard-sets PATH in the unit.)
+install -d -m 0755 /etc/profile.d
+cat > /etc/profile.d/10-local-bin.sh <<'PROFILE_EOF'
+# Native per-user installs (e.g. claude-code) live in ~/.local/bin — put it on PATH.
+# Guarded so it never duplicates. Sourced by login shells (bash via /etc/profile; zsh
+# login via /etc/zsh/zprofile -> /etc/profile).
+case ":$PATH:" in
+  *":$HOME/.local/bin:"*) ;;
+  *) export PATH="$HOME/.local/bin:$PATH" ;;
+esac
+PROFILE_EOF
+chmod 0644 /etc/profile.d/10-local-bin.sh
+log "/etc/profile.d/10-local-bin.sh (~/.local/bin on PATH for login shells)"

 # 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and
 #     ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind
--- a/scripts/workstation/skel/start-claude.sh
+++ b/scripts/workstation/skel/start-claude.sh
@ -11,6 +11,14 @@ echo "  Starting Claude Code in $HOME/code ..."
 echo "  (Right-click for tmux menu, or Ctrl+B then | or - to split)"
 echo ""

+# The native claude install lives in ~/.local/bin. This launcher runs in tmux's non-login
+# env, which does NOT source the user's shell rc (where the native installer added it to
+# PATH) — so `claude` would appear missing here. Put it on PATH ourselves; guarded/idempotent.
+case ":$PATH:" in
+  *":$HOME/.local/bin:"*) ;;
+  *) export PATH="$HOME/.local/bin:$PATH" ;;
+esac
+
 name_args=()
 if [ -n "${TMUX:-}" ]; then
  sess="$(tmux display-message -p '#{session_name}' 2>/dev/null)"
@ -42,14 +50,48 @@ else
  done
 fi

-# Prefer the system-wide `claude` (installed by setup-devvm.sh); fall back to npx.
+# Run the NATIVE `claude` (the recommended install: ~/.local/bin/claude, self-updating).
+# No npm/npx. If the native binary is missing (a fresh account before the hourly reconcile
+# has provisioned it), bootstrap it with the official native installer, then run it.
 launch() {
-  if command -v claude >/dev/null 2>&1; then
-    claude "$@"
+  if ! command -v claude >/dev/null 2>&1; then
+    echo "  Installing Claude Code (native) for $(id -un) …"
+    curl -fsSL https://claude.ai/install.sh | bash || return 127
+    export PATH="$HOME/.local/bin:$PATH"
+  fi
+  claude "$@"
+}
+
+# Re-assert Claude Code's first-run onboarding flag before launch. ~/.claude.json is a
+# SINGLE file that ALL of a user's concurrent claude processes (this terminal, their
+# t3-serve instance, agent/SDK sessions) read-modify-write; a stale writer periodically
+# drops top-level keys — including hasCompletedOnboarding — which throws the next
+# interactive session back to the "Choose the text style" wizard even though the user is
+# fully logged in (credentials live in the SEPARATE ~/.claude/.credentials.json, which is
+# never affected). Idempotent, runs as the user right before launch, never clobbers other
+# keys. Best-effort: no-op if jq is missing or the file is empty/corrupt (claude self-heals).
+ensure_onboarding() {
+  command -v jq >/dev/null 2>&1 || return 0
+  local cfg="$HOME/.claude.json" ver tmp
+  ver="$(claude --version 2>/dev/null | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)"
+  if [ -s "$cfg" ]; then
+    jq -e . "$cfg" >/dev/null 2>&1 || return 0                                     # corrupt -> leave for claude
+    [ "$(jq -r '.hasCompletedOnboarding // false' "$cfg")" = "true" ] && return 0  # already set -> no write
+  elif [ -e "$cfg" ]; then
+    return 0                                                                       # empty (mid-write?) -> leave it
+  fi
+  tmp="$(mktemp "${cfg}.XXXXXX")" || return 0
+  if [ -f "$cfg" ]; then
+    jq --arg v "$ver" '.hasCompletedOnboarding = true
+      | (if $v != "" then .lastOnboardingVersion = $v else . end)' "$cfg" > "$tmp" 2>/dev/null \
+      && chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp"
  else
-    npx @anthropic-ai/claude-code "$@"
+    jq -n --arg v "$ver" '{hasCompletedOnboarding: true}
+      + (if $v != "" then {lastOnboardingVersion: $v} else {} end)' > "$tmp" 2>/dev/null \
+      && chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp"
  fi
 }
+ensure_onboarding

 # Deliberately not `exec` so we can branch on the exit code: clean quit ends the
 # pane (ttyd closes the terminal); a crash drops to a shell so the tmux session
--- a/stacks/android-emulator/variables.tf
+++ b/stacks/android-emulator/variables.tf
@ -6,5 +6,5 @@ variable "tls_secret_name" {
 variable "image_tag" {
  type        = string
  default     = "latest"
-  description = "android-emulator image tag at forgejo.viktorbarzin.me/viktor/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) -> ghcr.io/viktorbarzin/android-emulator on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build."
+  description = "android-emulator image tag at ghcr.io/viktorbarzin/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build."
 }
--- a/stacks/anisette/main.tf
+++ b/stacks/anisette/main.tf
@ -0,0 +1,189 @@
+# anisette — self-hosted Apple anisette-data server for SideStore/AltStore.
+#
+# Purpose (infra issue #40): the TripIt iOS Shell is sideloaded with SideStore
+# using a free Apple ID. SideStore needs an "anisette" server to broker the
+# Apple-ID auth dance; the public community anisette servers see every login,
+# so we run our own. Stateless HTTP service on a stable INTERNAL endpoint
+# (anisette.viktorbarzin.lan) that SideStore points at.
+#
+# Image: Dadoum/anisette-v3-server — the de-facto standard anisette-v3 server
+# for SideStore/AltStore (the same project SideStore's own docs point at).
+# Upstream publishes ONLY a mutable :latest tag (no GitHub releases, no semver,
+# no date/sha tags — verified 2026-06-14), so we pin by MANIFEST DIGEST instead
+# (immutable, honours the "never :latest" rule). DockerHub is pulled
+# transparently via the registry-VM pull-through cache, same as echo/cyberchef.
+# To bump: `docker buildx imagetools inspect dadoum/anisette-v3-server:latest`,
+# then replace the digest below.
+#
+# Stateless: the container caches Apple provisioning libraries under
+# /home/Alcoholic/.config/anisette-v3/lib (a regenerable download — re-fetched
+# if absent — and per upstream issue #23 it does NOT preserve client auth across
+# restarts anyway). So an emptyDir is the honest fit: keeps that path writable
+# without taking on a backup-pipeline obligation. No PVC, no Vault secret.
+
+variable "tls_secret_name" {
+  type      = string
+  sensitive = true
+}
+
+resource "kubernetes_namespace" "anisette" {
+  metadata {
+    name = "anisette"
+    labels = {
+      "istio-injection" : "disabled"
+      tier = local.tiers.aux
+    }
+  }
+  lifecycle {
+    # KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
+    ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
+  }
+}
+
+module "tls_secret" {
+  source          = "../../modules/kubernetes/setup_tls_secret"
+  namespace       = kubernetes_namespace.anisette.metadata[0].name
+  tls_secret_name = var.tls_secret_name
+}
+
+resource "kubernetes_deployment" "anisette" {
+  metadata {
+    name      = "anisette"
+    namespace = kubernetes_namespace.anisette.metadata[0].name
+    labels = {
+      app  = "anisette"
+      tier = local.tiers.aux
+    }
+  }
+  # anisette downloads + initializes Apple's CoreADI provisioning library on
+  # first start (slow, memory-spiky). wait_for_rollout=false so the apply never
+  # blocks on — and never strands out of terraform state — a pod that is still
+  # warming up (mirrors tts/llama-cpp). Pod health is still gated by the
+  # readiness probe below, so the Service only routes once it's actually up.
+  wait_for_rollout = false
+  spec {
+    replicas = 1
+    selector {
+      match_labels = {
+        app = "anisette"
+      }
+    }
+    template {
+      metadata {
+        labels = {
+          app = "anisette"
+        }
+        annotations = {
+          # Diun notify-only watch. Upstream tags only :latest, so watch the
+          # digest of :latest rather than a semver pattern.
+          "diun.enable"       = "true"
+          "diun.watch_repo"   = "false"
+          "diun.include_tags" = "^latest$"
+        }
+      }
+      spec {
+        container {
+          # Pinned by digest — upstream ships only a mutable :latest (no tags).
+          # The `docker.io/` prefix is REQUIRED, not cosmetic: the Kyverno
+          # require-trusted-registries policy allowlists `docker.io/*` but NOT a
+          # bare `dadoum/*` prefix (only enumerated DockerHub user repos like
+          # mendhak/*, mpepping/* are listed in
+          # stacks/kyverno/modules/kyverno/security-policies.tf). A bare
+          # `dadoum/anisette-v3-server@...` is denied at admission; the explicit
+          # docker.io/ registry matches the allowlist and still pulls via the
+          # 10.0.20.10 pull-through cache.
+          image = "docker.io/dadoum/anisette-v3-server@sha256:1e20384985d3c49965f444bef39d627768dacc39ea0dca91f2a535edb7591ba3"
+          name  = "anisette"
+          port {
+            name           = "http"
+            container_port = 6969
+          }
+          # The image runs as the non-root user "Alcoholic" and writes its
+          # provisioning-library cache here; back it with an emptyDir so the
+          # path is writable (stateless — wiped on restart, re-downloaded).
+          volume_mount {
+            name       = "provisioning-cache"
+            mount_path = "/home/Alcoholic/.config/anisette-v3/lib"
+          }
+          resources {
+            requests = {
+              cpu    = "10m"
+              memory = "256Mi"
+            }
+            limits = {
+              # anisette downloads + initializes Apple's CoreADI provisioning
+              # library at startup, which spikes past 128Mi → OOMKilled (exit
+              # 137) before it can bind :6969. 512Mi gives headroom; steady
+              # state is much lower.
+              memory = "512Mi"
+            }
+          }
+          readiness_probe {
+            http_get {
+              path = "/"
+              port = 6969
+            }
+            period_seconds        = 15
+            initial_delay_seconds = 5
+          }
+          liveness_probe {
+            http_get {
+              path = "/"
+              port = 6969
+            }
+            period_seconds    = 30
+            failure_threshold = 6
+          }
+        }
+        volume {
+          name = "provisioning-cache"
+          empty_dir {}
+        }
+      }
+    }
+  }
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+    ]
+  }
+}
+
+resource "kubernetes_service" "anisette" {
+  metadata {
+    name      = "anisette"
+    namespace = kubernetes_namespace.anisette.metadata[0].name
+    labels = {
+      "app" = "anisette"
+    }
+  }
+  spec {
+    selector = {
+      app = "anisette"
+    }
+    port {
+      name        = "http"
+      port        = "80"
+      target_port = "6969"
+    }
+  }
+}
+
+module "ingress" {
+  source = "../../modules/kubernetes/ingress_factory"
+  # auth = "none": SideStore is a native iOS client — it can't replay the
+  # Authentik forward-auth cookie dance, so Authentik would break it (same
+  # reasoning as android-emulator's adb). Internal-only: anisette.viktorbarzin.lan,
+  # allow_local_access_only locks it to the LAN, and it brokers no user data of
+  # ours (it just relays Apple-ID anisette data). Never publicly exposed.
+  auth                    = "none"
+  namespace               = kubernetes_namespace.anisette.metadata[0].name
+  name                    = "anisette"
+  root_domain             = "viktorbarzin.lan"
+  tls_secret_name         = var.tls_secret_name
+  allow_local_access_only = true
+  ssl_redirect            = false
+  extra_annotations = {
+    "gethomepage.dev/enabled" = "false"
+  }
+}
--- a/stacks/anisette/secrets
+++ b/stacks/anisette/secrets
@ -0,0 +1 @@
+../../secrets
--- a/stacks/anisette/terragrunt.hcl
+++ b/stacks/anisette/terragrunt.hcl
@ -0,0 +1,8 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
--- a/stacks/authentik/admin-services-restriction.tf
+++ b/stacks/authentik/admin-services-restriction.tf
@ -49,6 +49,21 @@ resource "authentik_policy_expression" "admin_services_restriction" {

    host = request.context.get("host", "")

+    # TripIt External containment fence (ADR-0020 in the tripit repo). Publicly
+    # self-enrolled TripIt users (group "TripIt External", assigned by the
+    # tripit-enrollment flow's user_write) may reach tripit.viktorbarzin.me and
+    # NOTHING else. MUST be the FIRST host-dispatch branch: it is a request.user
+    # predicate that must dominate every host branch below, ESPECIALLY the
+    # default-allow `if host not in ADMIN_ONLY_HOSTS: return True` — placed after
+    # it, a tagged user would slip into other hosts. Safe to add: the group is
+    # net-new and created EMPTY, so this matches zero existing principals (no
+    # lockout). The fence is forward-auth ONLY; OIDC apps (Vault, Immich, …)
+    # contain External users via their own per-app group bindings — see
+    # docs/runbooks/tripit-external-signup.md. NEVER co-assign "TripIt External"
+    # to a trusted/internal user (this branch would fence them out of admin hosts).
+    if ak_is_group_member(request.user, name="TripIt External"):
+        return host == "tripit.viktorbarzin.me"
+
    # t3 Workstation edge gate: only members of "T3 Users" may reach t3.
    # Placed BEFORE the ADMIN_ONLY_HOSTS early-return (t3 is intentionally not in
    # that set — it must not require Home-Server-Admins, just T3 Users membership).
--- a/stacks/authentik/tripit-external.tf
+++ b/stacks/authentik/tripit-external.tf
@ -0,0 +1,22 @@
+# "TripIt External" group — containment anchor for publicly self-enrolled TripIt
+# users (ADR-0020 in the tripit repo). Members are admitted to
+# tripit.viktorbarzin.me ONLY and denied every other *.viktorbarzin.me
+# forward-auth host by the prepended branch in admin-services-restriction.tf.
+#
+# Created EMPTY and PARENTLESS, on purpose:
+#   * EMPTY — the no-lockout guarantee. Zero members at apply time => the
+#     prepended policy branch matches zero existing principals => it cannot
+#     change anyone's authorization (contrast authentik_group "T3 Users", which
+#     is created WITH members atomically because THAT gate's safety property is
+#     the opposite). Membership is assigned at RUNTIME by the tripit-enrollment
+#     flow's user_write "Create users group" option (UI-managed per the ADR
+#     management split). Terraform owns only the group's EXISTENCE.
+#   * PARENTLESS — do NOT make this a child of "Allow Login Users". The sensitive
+#     OIDC apps gate on "Home Server Admins" / "Headscale Users" / "Wrongmove
+#     Users" (children of "Allow Login Users") or, for Vault, on "Allow Login
+#     Users" itself (bound as part of ADR-0020). Keeping External out of that
+#     tree is what stops these users reaching OIDC apps — mirrors guest.tf, which
+#     keeps the guest group out of "Allow Login Users" for the same reason.
+resource "authentik_group" "tripit_external" {
+  name = "TripIt External"
+}
--- a/stacks/authentik/vault-authz-binding.tf
+++ b/stacks/authentik/vault-authz-binding.tf
@ -0,0 +1,27 @@
+# Vault OIDC authorization fence (ADR-0020). The "Vault" Authentik application had
+# NO authorization binding (audit 2026-06-15: any authenticated identity could
+# complete Vault OIDC login and receive Vault's built-in `default`-policy token —
+# token self-management/cubbyhole, no secret access, but still more than an
+# outside user should hold). Bind it to "Allow Login Users" so only established
+# homelab users can log in: they inherit that base group via its children
+# (Home Server Admins / Headscale Users / Wrongmove Users — verified live that
+# `User.all_groups()` includes the parent), while publicly self-enrolled
+# "TripIt External" users (deliberately PARENTLESS, so NOT in Allow Login Users)
+# are denied at the Vault consent step. Closes the one OIDC app the forward-auth
+# fence cannot reach; the other sensitive OIDC apps already bind a trusted group.
+#
+# The Vault application itself stays UI-managed (like the other OIDC apps); this
+# adds ONLY the authorization binding. policy_engine_mode on the app is "any", so
+# one group binding == membership in that group is required to authorize.
+#
+# UUIDs are PINNED as literals: this provider version has NO
+# `data "authentik_application"` data source (CI pipeline 198 failed on it), and
+# both objects are UI-managed and stable. To re-fetch if either is recreated, run
+# `ak shell` in the goauthentik-server pod and read
+# `Application.objects.get(name="Vault").pbm_uuid` and
+# `Group.objects.get(name="Allow Login Users").group_uuid`.
+resource "authentik_policy_binding" "vault_allow_login_users" {
+  target = "fe5698e3-b6b1-4475-98fa-ce2bae22f4dd" # Authentik application "Vault" (pbm_uuid)
+  group  = "b4823cd7-8ed8-4d2f-8f94-bc285138f853" # group "Allow Login Users" (group_uuid)
+  order  = 0
+}
--- a/stacks/dbaas/modules/dbaas/main.tf
+++ b/stacks/dbaas/modules/dbaas/main.tf
@ -427,7 +427,7 @@ resource "kubernetes_cron_job_v1" "mysql-backup" {
    failed_jobs_history_limit = 5
    schedule                  = "30 0 * * *"
    # schedule                      = "* * * * *"
-    starting_deadline_seconds     = 10
+    starting_deadline_seconds     = 600
    successful_jobs_history_limit = 10
    job_template {
      metadata {}
@ -519,7 +519,7 @@ resource "kubernetes_cron_job_v1" "mysql-backup-per-db" {
    concurrency_policy            = "Replace"
    failed_jobs_history_limit     = 3
    schedule                      = "45 0 * * *"
-    starting_deadline_seconds     = 10
+    starting_deadline_seconds     = 600
    successful_jobs_history_limit = 3
    job_template {
      metadata {}
@ -1607,7 +1607,12 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" {
    failed_jobs_history_limit = 5
    schedule                  = "0 0 * * *"
    # schedule                      = "* * * * *"
-    starting_deadline_seconds     = 10
+    # 600s (was 10s): a 10s deadline silently DROPPED the 2026-06-13 00:00 run
+    # when the CronJob controller was late at the midnight backup/IO-storm tick,
+    # leaving the last full dump 37h old (fired PostgreSQLBackupStale). 600s lets
+    # a brief controller lag still launch the job. Same fix on the other three
+    # dbaas backup crons (they share the midnight window).
+    starting_deadline_seconds     = 600
    successful_jobs_history_limit = 10
    job_template {
      metadata {}
@ -1695,7 +1700,7 @@ resource "kubernetes_cron_job_v1" "postgresql-backup-per-db" {
    concurrency_policy            = "Replace"
    failed_jobs_history_limit     = 3
    schedule                      = "15 0 * * *"
-    starting_deadline_seconds     = 10
+    starting_deadline_seconds     = 600
    successful_jobs_history_limit = 3
    job_template {
      metadata {}
--- a/stacks/f1-stream/main.tf
+++ b/stacks/f1-stream/main.tf
@ -128,7 +128,7 @@ resource "kubernetes_deployment" "f1-stream" {
      }
      spec {
        container {
-          image             = "forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}"
+          image             = "ghcr.io/viktorbarzin/f1-stream:${var.image_tag}"
          image_pull_policy = "Always"
          name              = "f1-stream"
          # Right-sized 2026-06-05: was 1Gi (bundled-Chromium era). The image is
--- a/stacks/forgejo/main.tf
+++ b/stacks/forgejo/main.tf
@ -11,6 +11,12 @@ resource "kubernetes_namespace" "forgejo" {
      "istio-injection" : "disabled"
      tier               = local.tiers.edge
      "keel.sh/enrolled" = "true"
+      # Opt out of the auto-generated tier-3-edge ResourceQuota (caps
+      # requests.memory at 4Gi). Forgejo's own pod requests 4Gi (the
+      # git + OCI-registry backbone, Guaranteed QoS), which pegged that
+      # tier quota at 100% and fired KubeQuotaAlmostFull. The
+      # forgejo-specific quota below gives headroom. Same pattern as dbaas.
+      "resource-governance/custom-quota" = "true"
    }
  }
  lifecycle {
@ -19,6 +25,26 @@ resource "kubernetes_namespace" "forgejo" {
  }
 }

+# Custom ResourceQuota — replaces the tier-3-edge auto quota (opted out via the
+# resource-governance/custom-quota label above). requests.memory is 8Gi so the
+# 4Gi Forgejo pod sits at ~50% (clears KubeQuotaAlmostFull + the healthcheck
+# resourcequota check) with room for a transient migration/sidecar pod. To
+# raise Forgejo's memory limit past 4Gi later, bump requests.memory here too.
+resource "kubernetes_resource_quota" "forgejo" {
+  metadata {
+    name      = "forgejo-quota"
+    namespace = kubernetes_namespace.forgejo.metadata[0].name
+  }
+  spec {
+    hard = {
+      "requests.cpu"    = "4"
+      "requests.memory" = "8Gi"
+      "limits.memory"   = "32Gi"
+      pods              = "30"
+    }
+  }
+}
+
 module "tls_secret" {
  source          = "../../modules/kubernetes/setup_tls_secret"
  namespace       = kubernetes_namespace.forgejo.metadata[0].name
@ -168,19 +194,29 @@ resource "kubernetes_deployment" "forgejo" {
            name       = "data"
            mount_path = "/data"
          }
-          # Bumped 1Gi -> 3Gi 2026-06-09: Forgejo was OOMKilled (exit 137)
-          # under registry-push load from in-cluster CI builds (tripit
-          # buildkit pushes large layers into the OCI registry). VPA
-          # upperBound reads ~1.5Gi, but that's suppressed by the 1Gi cap it
-          # kept OOMing against — size for the push spike, not steady-state.
+          # Bumped 1Gi -> 3Gi 2026-06-09, then 3Gi -> 4Gi 2026-06-13.
+          # OOMKilled again (exit 137) at the 3Gi cap on 2026-06-13 (2
+          # restarts; briefly took the git remote + OCI registry down and
+          # spiked ingress TTFB/4xx). Steady-state ~2.2Gi but it spiked past
+          # the 3Gi cap. 4Gi is the CEILING here: the forgejo namespace
+          # tier-quota caps requests.memory at 4Gi and Guaranteed QoS means
+          # request == limit, so a pod can request at most 4Gi. A first
+          # attempt at 6Gi was REJECTED (FailedCreate: exceeded quota) and
+          # left forgejo with 0 pods until reverted -- do NOT raise memory
+          # past 4Gi without ALSO raising the tier-quota. The 6/9 OOM driver
+          # (tripit buildkit registry pushes) is gone now that the Forgejo
+          # registry was frozen + emptied 2026-06-13 (ADR-0002, ghcr), so the
+          # remaining spike is git ops / integrity-probe catalog walk / a
+          # possible leak; 4Gi should suffice. If it still OOMs, raise the
+          # tier-quota and this limit together.
          # requests=limits (Guaranteed QoS) per the repo memory convention.
          resources {
            requests = {
              cpu    = "15m"
-              memory = "3Gi"
+              memory = "4Gi"
            }
            limits = {
-              memory = "3Gi"
+              memory = "4Gi"
            }
          }
          port {
--- a/stacks/health/main.tf
+++ b/stacks/health/main.tf
@ -9,7 +9,7 @@ resource "kubernetes_namespace" "health" {
  metadata {
    name = "health"
    labels = {
-      tier = local.tiers.aux
+      tier               = local.tiers.aux
      "keel.sh/enrolled" = "true"
    }
  }
@ -128,6 +128,15 @@ resource "kubernetes_deployment" "health" {
            name  = "COOKIE_SECURE"
            value = "true"
          }
+          env {
+            # ADR-0008 (health repo): identity for the internal LAN test host.
+            # Only reached when no X-authentik-email header is present — i.e. via
+            # the auth="none" test ingress below. The public host's forward-auth
+            # fails closed, so requests arriving there always carry the real
+            # header and never fall back to this value.
+            name  = "DEV_AUTH_EMAIL"
+            value = "vbarzin@gmail.com"
+          }

          volume_mount {
            name       = "uploads"
@ -197,6 +206,15 @@ module "ingress" {
  name            = "health"
  tls_secret_name = var.tls_secret_name
  max_body_size   = "100m"
+  # The redesigned SPA bursts well past the default 10/50 limiter on each page
+  # load (shell + fonts + a 5-8 call API burst). Swap the shared limiter for a
+  # health-specific one (100/1000), mirroring tripit/immich/actualbudget.
+  # The ref MUST carry the middleware's namespace prefix: the CRD lives in the
+  # `traefik` ns, so it's `traefik-health-rate-limit@kubernetescrd` (same form as
+  # traefik-tripit-rate-limit). Without the prefix Traefik can't resolve it and
+  # 404s the whole router.
+  skip_default_rate_limit = true
+  extra_middlewares       = ["traefik-health-rate-limit@kubernetescrd"]
  extra_annotations = {
    "gethomepage.dev/enabled"      = "true"
    "gethomepage.dev/name"         = "Health"
@ -207,6 +225,30 @@ module "ingress" {
  }
 }

+# https://health-test.viktorbarzin.lan — internal LAN-only test host for
+# automated/E2E testing + manual screenshots without the Authentik SSO dance
+# (ADR-0008). Same `health` deployment; acts as DEV_AUTH_EMAIL=vbarzin@gmail.com.
+module "ingress_test" {
+  source = "../../modules/kubernetes/ingress_factory"
+  # auth = "none": LAN-only (allow_local_access_only) test host — no public
+  # exposure; the public health.viktorbarzin.me ingress above stays
+  # auth="required". No user data gate here by design — it serves the real app
+  # as DEV_AUTH_EMAIL since no X-authentik-email is injected (ADR-0008).
+  auth                    = "none"
+  namespace               = kubernetes_namespace.health.metadata[0].name
+  name                    = "health-test"
+  root_domain             = "viktorbarzin.lan"
+  service_name            = kubernetes_service.health.metadata[0].name
+  tls_secret_name         = var.tls_secret_name
+  allow_local_access_only = true
+  ssl_redirect            = false
+  max_body_size           = "100m"
+  anti_ai_scraping        = false
+  extra_annotations = {
+    "gethomepage.dev/enabled" = "false"
+  }
+}
+
 resource "kubernetes_manifest" "external_secret_db" {
  manifest = {
    apiVersion = "external-secrets.io/v1beta1"
--- a/stacks/k8s-portal/modules/k8s-portal/files/Dockerfile
+++ b/stacks/k8s-portal/modules/k8s-portal/files/Dockerfile
@ -1,7 +1,7 @@
 FROM node:22-alpine AS build
 WORKDIR /app
 COPY package*.json ./
-RUN npm ci
+RUN npm install --no-audit --no-fund
 COPY . .
 RUN npm run build

--- a/stacks/k8s-portal/modules/k8s-portal/files/package-lock.json
+++ b/stacks/k8s-portal/modules/k8s-portal/files/package-lock.json
--- a/stacks/k8s-portal/modules/k8s-portal/files/package.json
+++ b/stacks/k8s-portal/modules/k8s-portal/files/package.json
@ -0,0 +1,24 @@
+{
+	"name": "k8s-portal",
+	"private": true,
+	"version": "0.0.1",
+	"type": "module",
+	"scripts": {
+		"dev": "vite dev",
+		"build": "vite build",
+		"preview": "vite preview",
+		"prepare": "svelte-kit sync || echo ''",
+		"check": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json",
+		"check:watch": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json --watch"
+	},
+	"devDependencies": {
+		"@sveltejs/adapter-auto": "^7.0.0",
+		"@sveltejs/adapter-node": "^5.5.3",
+		"@sveltejs/kit": "^2.50.2",
+		"@sveltejs/vite-plugin-svelte": "^6.2.4",
+		"svelte": "^5.49.2",
+		"svelte-check": "^4.3.6",
+		"typescript": "^5.9.3",
+		"vite": "^7.3.1"
+	}
+}
--- a/stacks/k8s-portal/modules/k8s-portal/main.tf
+++ b/stacks/k8s-portal/modules/k8s-portal/main.tf
@ -9,7 +9,7 @@ resource "kubernetes_namespace" "k8s_portal" {
  metadata {
    name = "k8s-portal"
    labels = {
-      tier = var.tier
+      tier               = var.tier
      "keel.sh/enrolled" = "true"
    }
  }
@ -40,6 +40,15 @@ resource "kubernetes_deployment" "k8s_portal" {
  metadata {
    name      = "k8s-portal"
    namespace = kubernetes_namespace.k8s_portal.metadata[0].name
+    # ADR-0002 / no-local-builds: image now GHA-built -> ghcr:latest
+    # (.github/workflows/build-k8s-portal.yml). Keel polls ghcr:latest and rolls
+    # this deployment (replaces the removed Woodpecker in-cluster build+deploy).
+    annotations = {
+      "keel.sh/policy"       = "force"
+      "keel.sh/trigger"      = "poll"
+      "keel.sh/pollSchedule" = "@every 5m"
+      "keel.sh/match-tag"    = "true"
+    }
    labels = {
      app  = "k8s-portal"
      tier = var.tier
@ -66,9 +75,16 @@ resource "kubernetes_deployment" "k8s_portal" {
      }

      spec {
+        # GHCR pull secret: the ghcr-credentials Secret in this namespace is
+        # cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy
+        # (allowlisted private-ghcr namespaces only — ADR-0002). Source of
+        # truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf.
+        image_pull_secrets {
+          name = "ghcr-credentials"
+        }
        container {
          name  = "portal"
-          image = "viktorbarzin/k8s-portal:latest"
+          image = "ghcr.io/viktorbarzin/k8s-portal:latest"
          port {
            container_port = 3000
          }
@ -121,7 +137,8 @@ resource "kubernetes_deployment" "k8s_portal" {
    # DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA); Kyverno mutates dns_config for ndots. Reviewed 2026-04-18.
    ignore_changes = [
      spec[0].template[0].spec[0].dns_config,         # KYVERNO_LIFECYCLE_V1
-      spec[0].template[0].spec[0].container[0].image, # CI updates image tag
+      spec[0].template[0].spec[0].container[0].image, # Keel manages ghcr:latest digest
+      metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 (Keel stamps on roll)
    ]
  }
 }
@ -172,5 +189,5 @@ module "ingress_setup_script" {
  ingress_path    = ["/setup/script", "/agent"]
  tls_secret_name = var.tls_secret_name
  # auth = "none": Setup script + agent endpoint must be curl-able without auth (no cookies preserved in automation).
-  auth            = "none"
+  auth = "none"
 }
--- a/stacks/kyverno/modules/kyverno/ghcr-credentials.tf
+++ b/stacks/kyverno/modules/kyverno/ghcr-credentials.tf
@ -27,6 +27,10 @@ locals {
    # openclaw's install-recruiter-plugin init container pulls the PRIVATE
    # ghcr.io/viktorbarzin/recruiter-responder:latest image (infra#27).
    "openclaw",
+    # k8s-portal: last in-cluster image build, migrated to GHA→ghcr (ADR-0002,
+    # "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE
+    # (infra repo default); the deployment references the cloned secret.
+    "k8s-portal",
  ]
 }

--- a/stacks/openclaw/main.tf
+++ b/stacks/openclaw/main.tf
@ -553,7 +553,7 @@ resource "kubernetes_deployment" "openclaw" {
          # IfNotPresent: a cached stale :latest meant the plugin manifest
          # (configSchema fix) never got pulled. An uncached SHA forces the
          # pull. Bump this when the openclaw plugin in nextcloud-todos changes.
-          image             = "forgejo.viktorbarzin.me/viktor/nextcloud-todos:f85c6de1"
+          image             = "ghcr.io/viktorbarzin/nextcloud-todos:latest"
          image_pull_policy = "Always"
          command = ["sh", "-c", <<-EOT
            set -eu
--- a/stacks/t3-afk/files/dispatcher.js
+++ b/stacks/t3-afk/files/dispatcher.js
@ -0,0 +1,136 @@
+// t3-afk auto-pair dispatcher
+// ----------------------------------------------------------------------------
+// Replicates the devvm t3-dispatch experience for the single in-cluster T3
+// instance. The ingress is Authentik-gated (auth=required), so every request
+// that reaches here is already authenticated. On a cookieless *document*
+// navigation we mint a one-time pairing credential (`t3 auth pairing create`)
+// and exchange it at the t3 server's /api/auth/browser-session endpoint for the
+// `t3_session` cookie, then 302 back — so the user never sees the manual
+// /pair#token screen. Everything else (incl. WebSocket upgrades for the cockpit
+// live stream + terminals) is reverse-proxied straight through to t3 serve.
+//
+// Single upstream, same pod (localhost) — kept dependency-free (Node stdlib).
+'use strict';
+const http = require('http');
+const net = require('net');
+const { execFile } = require('child_process');
+
+const UPSTREAM_HOST = '127.0.0.1';
+const UPSTREAM_PORT = Number(process.env.T3_UPSTREAM_PORT || 3773);
+const LISTEN_PORT = Number(process.env.DISPATCHER_PORT || 8080);
+const T3_BIN = process.env.T3_BIN || '/data/npm-global/bin/t3';
+const BASE_DIR = process.env.T3CODE_HOME || '/data/t3';
+const COOKIE = 't3_session';
+const childEnv = { ...process.env, PATH: '/data/npm-global/bin:' + (process.env.PATH || ''), HOME: '/home/node' };
+
+const hasSession = (req) =>
+  (req.headers.cookie || '').split(/;\s*/).some((c) => c.startsWith(COOKIE + '='));
+
+const isDocNav = (req) => {
+  if (req.method !== 'GET') return false;
+  const dest = req.headers['sec-fetch-dest'];
+  if (dest) return dest === 'document';
+  return (req.headers['accept'] || '').includes('text/html');
+};
+
+const mintCredential = () =>
+  new Promise((resolve, reject) => {
+    execFile(
+      T3_BIN,
+      ['auth', 'pairing', 'create', '--base-dir', BASE_DIR, '--ttl', '5m', '--json'],
+      { env: childEnv, timeout: 15000 },
+      (err, stdout) => {
+        if (err) return reject(err);
+        try {
+          const cred = JSON.parse(stdout).credential;
+          cred ? resolve(cred) : reject(new Error('no credential in pairing output'));
+        } catch (e) {
+          reject(e);
+        }
+      },
+    );
+  });
+
+const exchange = (credential) =>
+  new Promise((resolve, reject) => {
+    const body = JSON.stringify({ credential });
+    const r = http.request(
+      {
+        host: UPSTREAM_HOST,
+        port: UPSTREAM_PORT,
+        path: '/api/auth/browser-session',
+        method: 'POST',
+        headers: { 'content-type': 'application/json', 'content-length': Buffer.byteLength(body) },
+      },
+      (resp) => {
+        const setCookie = resp.headers['set-cookie'] || [];
+        resp.resume();
+        resp.on('end', () =>
+          resp.statusCode === 200 && setCookie.length
+            ? resolve(setCookie)
+            : reject(new Error('browser-session exchange returned ' + resp.statusCode)),
+        );
+      },
+    );
+    r.on('error', reject);
+    r.write(body);
+    r.end();
+  });
+
+const proxyHttp = (req, res) => {
+  const up = http.request(
+    { host: UPSTREAM_HOST, port: UPSTREAM_PORT, path: req.url, method: req.method, headers: req.headers },
+    (r) => {
+      res.writeHead(r.statusCode, r.headers);
+      r.pipe(res);
+    },
+  );
+  up.on('error', () => {
+    if (!res.headersSent) res.writeHead(502);
+    res.end('bad gateway');
+  });
+  req.pipe(up);
+};
+
+const server = http.createServer(async (req, res) => {
+  if (req.url === '/healthz') {
+    res.writeHead(200);
+    return res.end('ok');
+  }
+  if (!hasSession(req) && isDocNav(req)) {
+    try {
+      const cred = await mintCredential();
+      const setCookie = await exchange(cred);
+      res.writeHead(302, { location: req.url || '/', 'set-cookie': setCookie, 'cache-control': 'no-store' });
+      return res.end();
+    } catch (err) {
+      // Fall through to a plain proxy; the cockpit's own /pair screen is the
+      // fallback if auto-pair ever fails, so we never hard-fail the request.
+      console.error('auto-pair failed, proxying through:', err.message);
+    }
+  }
+  proxyHttp(req, res);
+});
+
+// WebSocket / Upgrade passthrough — the cockpit's live orchestration stream and
+// terminals need this. Reconstruct the upgrade request and splice the sockets.
+server.on('upgrade', (req, socket, head) => {
+  const up = net.connect(UPSTREAM_PORT, UPSTREAM_HOST, () => {
+    up.write(
+      `${req.method} ${req.url} HTTP/1.1\r\n` +
+        Object.entries(req.headers)
+          .map(([k, v]) => `${k}: ${v}`)
+          .join('\r\n') +
+        '\r\n\r\n',
+    );
+    if (head && head.length) up.write(head);
+    socket.pipe(up);
+    up.pipe(socket);
+  });
+  up.on('error', () => socket.destroy());
+  socket.on('error', () => up.destroy());
+});
+
+server.listen(LISTEN_PORT, '0.0.0.0', () =>
+  console.log(`t3-afk dispatcher listening on :${LISTEN_PORT} -> ${UPSTREAM_HOST}:${UPSTREAM_PORT}`),
+);
--- a/stacks/t3-afk/files/issue-implementer-CLAUDE.md
+++ b/stacks/t3-afk/files/issue-implementer-CLAUDE.md
@ -0,0 +1,59 @@
+# issue-implementer — autonomous AFK coding agent
+
+You are **issue-implementer**, an autonomous agent that implements ONE GitHub
+issue end-to-end and lands it, with no human at the keyboard. This file is your
+standing behaviour; the specific task arrives as your prompt. You run inside a
+T3 Code thread in `full-access` mode (skip-permissions) — there is no one to
+answer questions mid-run.
+
+## Autonomy — non-negotiable (you will hang otherwise)
+
+- **Never enter plan mode and never call `ExitPlanMode`.** It is intercepted and
+  will stall this thread forever.
+- **Never ask clarifying questions / never call `AskUserQuestion`.** No human is
+  watching. Make the most reasonable assumption, state it in a commit/your final
+  message, and proceed.
+- If you hit something you genuinely cannot resolve safely, **stop and write a
+  precise blocker report as your final message** (what you tried, what's
+  unresolved, what you'd need). Do not thrash. The orchestrator escalates it to a
+  human — that is the only "ask for help" channel you have.
+
+## What to do
+
+1. **Understand the task.** Your prompt contains the issue (number, what to
+   build, acceptance criteria). Read the issue's AGENT-BRIEF if present.
+2. **Work in the prepared worktree.** You are already in a git worktree on a
+   branch off `master`. Read the repo's own `CLAUDE.md`, `CONTEXT.md`, and any
+   `docs/adr/` in the area you touch — use its domain vocabulary and respect its
+   decisions.
+3. **Test-first (TDD).** Write a failing test that captures the desired
+   behaviour, make it pass, then refactor. Prefer property/parameterized tests.
+   Run the repo's actual test suite and get it green before you commit. Do not
+   test implementation details — test external behaviour.
+4. **Commit.** Subject = what changed; body = why, paraphrasing the issue in
+   plain words. Include `Closes #<issue-number>` and the trailer
+   `Implemented-by: issue-implementer (AFK)`. Stage files by name — never
+   `git add -A`/`.`. Never skip hooks.
+5. **Land it.** Push your branch to `master` (`git push origin HEAD:master`). If
+   the push is rejected non-fast-forward, fetch, merge `origin/master`, re-run
+   the tests, and push again. Pushing to `master` is the intended behaviour —
+   CI builds and deploys from there.
+6. **Report.** Your final message is a concise summary: what you built, the
+   commit, and anything a reviewer should know. (CI/deploy watching and any
+   fix-forward/freeze handling are done by the control plane, not by you — once
+   you've pushed green code, your job is done.)
+
+## Guardrails (hard limits)
+
+- **Never force-push** to `master`.
+- **Never delete PVCs/PVs**, drop database tables, or run destructive data ops.
+- **Never edit Vault directly**, and never commit secrets.
+- **Infrastructure changes go through Terraform/Terragrunt only** — never
+  `kubectl apply/edit/patch` as the final state.
+- **Never use `[ci skip]`** — it hides the change from the audit feed.
+- Stay within the issue's scope. Don't refactor adjacent code beyond what the
+  task needs.
+
+## Done means
+
+Tests green **and** pushed to `master`. Not "code written" — landed.
--- a/stacks/t3-afk/main.tf
+++ b/stacks/t3-afk/main.tf
@ -0,0 +1,408 @@
+# =============================================================================
+# t3-afk — dedicated, in-cluster T3 Code instance: the EXECUTOR + COCKPIT for the
+# AFK implementation pipeline (slice #2 of claude-agent-service PRD #1).
+#
+# claude-agent-service (control plane) dispatches issues INTO this T3 instance
+# over its orchestration HTTP API; T3 runs the issue-implementer agent in a git
+# worktree and shows every worker in its cockpit. See:
+#   claude-agent-service/docs/2026-06-14-afk-implementation-pipeline-design.md
+#   claude-agent-service/docs/adr/0003-t3-thin-executor-and-cockpit.md
+#
+# PILOT SHORTCUT (chosen 2026-06-14): no custom-built image. We run stock
+# `node:24` (the full image ships git + python3/make/g++ for node-pty) and an
+# init container installs PINNED npm packages (t3@0.0.27 + the Claude CLI) onto
+# the SSD PVC, cached across restarts. Formalize a digest-pinned built image
+# post-GO. T3 is version-pinned (npm) and NOT Keel-enrolled.
+# =============================================================================
+
+# No plan-time Vault reads — every secret flows through the ExternalSecret below
+# (CLAUDE_CODE_OAUTH_TOKEN / GITHUB_TOKEN / FORGEJO_TOKEN), injected as env at
+# runtime. Nothing here needs a secret value at plan time.
+
+# Wildcard TLS secret name — value comes from config.tfvars; consumed by the
+# ingress factory (every stack that uses the factory declares this).
+variable "tls_secret_name" {}
+
+locals {
+  namespace = "t3-afk"
+  # Stock node base — the FULL node:24 (not -slim) is buildpack-deps-based, so it
+  # ships git + build-essential (python3/make/g++) that node-pty + the agent need.
+  # Fully-qualified (docker.io/library/...) to satisfy the Kyverno
+  # require-trusted-registries allowlist via `docker.io/*` — bare `node*` is NOT
+  # on the bare-DockerHub-library list (alpine*/busybox*/python* are).
+  image = "docker.io/library/node:24"
+  # Pinned npm versions installed at startup (the reproducibility anchor for the
+  # pilot until a digest-pinned image exists).
+  t3_version         = "0.0.27"
+  claude_cli_version = "latest" # @anthropic-ai/claude-code
+  labels = {
+    app = "t3-afk"
+  }
+}
+
+# --- Namespace ---
+
+resource "kubernetes_namespace" "t3_afk" {
+  metadata {
+    name = local.namespace
+    labels = {
+      tier = local.tiers.aux
+    }
+  }
+}
+
+# --- Secrets ---
+# The Claude provider authenticates with CLAUDE_CODE_OAUTH_TOKEN (T3 passes the
+# environment straight through to the embedded claude-agent-sdk + claude CLI).
+# GITHUB_TOKEN / FORGEJO_TOKEN authenticate the agent's `git push` from worktrees
+# (wired into ~/.gitconfig insteadOf rewrites in the container command).
+
+resource "kubernetes_manifest" "external_secret" {
+  manifest = {
+    apiVersion = "external-secrets.io/v1beta1"
+    kind       = "ExternalSecret"
+    metadata = {
+      name      = "t3-afk-secrets"
+      namespace = local.namespace
+    }
+    spec = {
+      refreshInterval = "15m"
+      secretStoreRef = {
+        name = "vault-kv"
+        kind = "ClusterSecretStore"
+      }
+      target = { name = "t3-afk-secrets" }
+      data = [
+        {
+          secretKey = "CLAUDE_CODE_OAUTH_TOKEN"
+          remoteRef = { key = "claude-agent-service", property = "claude_oauth_token" }
+        },
+        {
+          secretKey = "GITHUB_TOKEN"
+          remoteRef = { key = "viktor", property = "github_pat" }
+        },
+        {
+          # Shared viktor-scoped admin PAT (also used by Woodpecker + the
+          # claude-agent pod). Lets the agent git push / open PRs on Forgejo.
+          secretKey = "FORGEJO_TOKEN"
+          remoteRef = { key = "ci/global", property = "forgejo_push_token" }
+        },
+      ]
+    }
+  }
+  depends_on = [kubernetes_namespace.t3_afk]
+}
+
+# issue-implementer behaviour is intentionally NOT mounted as ~/.claude/CLAUDE.md:
+# T3's SDK invocation doesn't honor it, and mounting a subPath into ~/.claude
+# makes that dir root-owned and breaks the agent's Bash session-env. The control
+# plane injects the behaviour as a dispatch message preamble instead;
+# files/issue-implementer-CLAUDE.md is kept as the canonical source for that text.
+
+# Auto-pair dispatcher script (run by the sidecar container below). Mirrors the
+# devvm t3-dispatch: on a cookieless, Authentik-gated page load it mints a
+# pairing credential and exchanges it for the t3_session cookie, so the user
+# never sees the manual /pair screen. Reverse-proxies everything else (incl.
+# WebSockets) to t3 serve.
+resource "kubernetes_config_map" "dispatcher" {
+  metadata {
+    name      = "t3-afk-dispatcher"
+    namespace = kubernetes_namespace.t3_afk.metadata[0].name
+  }
+  data = {
+    "dispatcher.js" = file("${path.module}/files/dispatcher.js")
+  }
+}
+
+# --- Storage ---
+# SSD-NFS (small-file friendly) for the T3 base dir: state.sqlite + the
+# server-signing-key (losing it invalidates every issued bearer), per-thread git
+# worktrees, the npm global install, and caches. ADR 0004.
+module "data" {
+  source     = "../../modules/kubernetes/nfs_volume"
+  name       = "t3-afk-data"
+  namespace  = kubernetes_namespace.t3_afk.metadata[0].name
+  nfs_server = "192.168.1.127"
+  nfs_path   = "/srv/nfs-ssd/t3-afk-data"
+  storage    = "30Gi"
+}
+
+# --- Deployment ---
+
+resource "kubernetes_deployment" "t3_afk" {
+  # Slow first start (image pull + npm install init + ESO secret sync) can
+  # exceed the default rollout-wait timeout; verify pod readiness out-of-band.
+  wait_for_rollout = false
+
+  metadata {
+    name      = "t3-afk"
+    namespace = kubernetes_namespace.t3_afk.metadata[0].name
+    labels    = local.labels
+    # keel.sh/policy=never must be a DEPLOYMENT-level annotation — that's where
+    # Keel reads it. (A pod-template label is ignored by Keel, which is why the
+    # earlier attempt failed.) The cluster's Kyverno inject-keel-annotations
+    # policy is opt-OUT: it stamps policy=patch on any workload that doesn't
+    # carry its own keel.sh/policy — and Keel then "patch"-downgraded
+    # node:24 -> node:24.0.2 (below t3@0.0.27's required node >=24.10), which
+    # crash-looped `t3 serve`. ADR 0003 (Keel-excluded).
+    annotations = {
+      "keel.sh/policy" = "never"
+    }
+  }
+
+  spec {
+    replicas = 1
+    # Single-writer state.sqlite — never run two pods against the same base dir.
+    strategy {
+      type = "Recreate"
+    }
+
+    selector {
+      match_labels = local.labels
+    }
+
+    template {
+      metadata {
+        labels = local.labels
+      }
+
+      spec {
+        security_context {
+          run_as_user  = 1000 # node
+          run_as_group = 1000
+          fs_group     = 1000
+        }
+
+        # NFS mounts land root-owned; make /data writable by uid 1000.
+        init_container {
+          name    = "fix-perms"
+          image   = "busybox:1.37"
+          command = ["sh", "-c", "mkdir -p /data && chown -R 1000:1000 /data && chmod 0775 /data"]
+          security_context {
+            run_as_user = 0
+          }
+          volume_mount {
+            name       = "data"
+            mount_path = "/data"
+          }
+          resources {
+            requests = { memory = "32Mi" }
+            limits   = { memory = "64Mi" }
+          }
+        }
+
+        # Install pinned t3 + Claude CLI onto the PVC (cached; skipped if already
+        # present). Runs as uid 1000 so the install is owned by the runtime user.
+        init_container {
+          name  = "install-t3"
+          image = local.image
+          command = ["bash", "-c", <<-EOF
+            set -e
+            export npm_config_cache=/data/npm-cache
+            export npm_config_prefix=/data/npm-global
+            mkdir -p /data/npm-global /data/npm-cache
+            if [ ! -x /data/npm-global/bin/t3 ]; then
+              echo "installing t3@${local.t3_version} + claude CLI ..."
+              npm install -g "t3@${local.t3_version}" "@anthropic-ai/claude-code@${local.claude_cli_version}"
+            else
+              echo "t3 already installed: $(/data/npm-global/bin/t3 --version 2>/dev/null || echo unknown)"
+            fi
+          EOF
+          ]
+          volume_mount {
+            name       = "data"
+            mount_path = "/data"
+          }
+          resources {
+            requests = { cpu = "200m", memory = "512Mi" }
+            limits   = { memory = "1Gi" }
+          }
+        }
+
+        container {
+          name  = "t3"
+          image = local.image
+
+          # Configure git auth for the agent's pushes, then run T3 headless.
+          # $$ escapes Terraform interpolation so the shell expands the env vars.
+          command = ["bash", "-c", <<-EOF
+            set -e
+            export PATH=/data/npm-global/bin:$$PATH
+            export npm_config_cache=/data/npm-cache
+
+            # git identity + token rewrites so the agent can push from worktrees.
+            git config --global user.name "issue-implementer (AFK)"
+            git config --global user.email "afk-agent@viktorbarzin.me"
+            git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "https://github.com/"
+            git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "git@github.com:"
+            if [ -n "$${FORGEJO_TOKEN}" ]; then
+              git config --global url."https://$${FORGEJO_TOKEN}@forgejo.viktorbarzin.me/".insteadOf "https://forgejo.viktorbarzin.me/"
+            fi
+
+            exec t3 serve --mode web --host 0.0.0.0 --port 3773 --base-dir /data/t3
+          EOF
+          ]
+
+          port {
+            container_port = 3773
+          }
+
+          env_from {
+            secret_ref {
+              name = "t3-afk-secrets"
+            }
+          }
+
+          env {
+            name  = "HOME"
+            value = "/home/node"
+          }
+          env {
+            name  = "T3CODE_HOME"
+            value = "/data/t3"
+          }
+
+          # T3's API needs auth even for liveness; use a TCP probe on the port.
+          liveness_probe {
+            tcp_socket {
+              port = 3773
+            }
+            initial_delay_seconds = 30
+            period_seconds        = 30
+          }
+          readiness_probe {
+            tcp_socket {
+              port = 3773
+            }
+            initial_delay_seconds = 15
+            period_seconds        = 10
+          }
+
+          volume_mount {
+            name       = "data"
+            mount_path = "/data"
+          }
+          # NOTE: do NOT mount anything into /home/node/.claude — a subPath
+          # mount makes that dir root-owned, which blocks the agent (uid 1000)
+          # from creating its Bash session-env there and breaks ALL Bash/git for
+          # the agent (root cause of the 2026-06-15 "agent never commits"). T3's
+          # SDK invocation doesn't honor ~/.claude/CLAUDE.md anyway, so the
+          # issue-implementer behaviour is injected via the dispatch message
+          # preamble by the control plane instead.
+
+          # Burstable (tier-aux). A live agent thread (node + claude) is memory
+          # heavy; size for a small number of concurrent threads on this pilot
+          # instance. No CPU limit per cluster policy.
+          resources {
+            requests = {
+              cpu    = "1"
+              memory = "2Gi"
+            }
+            # Capped at the tier-aux LimitRange max (4Gi/container). If real
+            # workloads OOM, opt the namespace out via the
+            # resource-governance/custom-limitrange label (as claude-agent-service
+            # does) and raise this.
+            limits = {
+              memory = "4Gi"
+            }
+          }
+        }
+
+        # Auto-pair dispatcher (sidecar). The Service points at this (:8080); it
+        # reverse-proxies to t3 serve (:3773) and injects the session cookie so
+        # the browser experience matches t3.viktorbarzin.me. Shares /data so it
+        # can exec the t3 CLI to mint pairing credentials.
+        container {
+          name    = "dispatcher"
+          image   = local.image
+          command = ["node", "/scripts/dispatcher.js"]
+          port {
+            container_port = 8080
+          }
+          env {
+            name  = "HOME"
+            value = "/home/node"
+          }
+          readiness_probe {
+            http_get {
+              path = "/healthz"
+              port = 8080
+            }
+            initial_delay_seconds = 10
+            period_seconds        = 10
+          }
+          volume_mount {
+            name       = "data"
+            mount_path = "/data"
+          }
+          volume_mount {
+            name       = "dispatcher"
+            mount_path = "/scripts"
+          }
+          resources {
+            requests = { cpu = "50m", memory = "64Mi" }
+            limits   = { memory = "256Mi" }
+          }
+        }
+
+        volume {
+          name = "data"
+          persistent_volume_claim {
+            claim_name = module.data.claim_name
+          }
+        }
+
+        volume {
+          name = "dispatcher"
+          config_map {
+            name = kubernetes_config_map.dispatcher.metadata[0].name
+          }
+        }
+      }
+    }
+  }
+
+  lifecycle {
+    ignore_changes = [
+      spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
+      # Kyverno's inject-keel-annotations stamps pollSchedule/trigger alongside
+      # the policy; we own keel.sh/policy=never above, but ignore these two so
+      # they don't perpetually drift the plan.
+      metadata[0].annotations["keel.sh/pollSchedule"],
+      metadata[0].annotations["keel.sh/trigger"],
+    ]
+  }
+}
+
+# --- Service ---
+
+resource "kubernetes_service" "t3_afk" {
+  metadata {
+    name      = "t3-afk"
+    namespace = kubernetes_namespace.t3_afk.metadata[0].name
+    labels    = local.labels
+  }
+  spec {
+    selector = local.labels
+    # Route to the auto-pair dispatcher sidecar (:8080), which reverse-proxies
+    # to t3 serve (:3773) after injecting the t3_session cookie.
+    port {
+      port        = 3773
+      target_port = 8080
+    }
+    type = "ClusterIP"
+  }
+}
+
+# --- Ingress ---
+# The cockpit has no built-in user auth, so Authentik forward-auth is the gate.
+module "ingress" {
+  source          = "../../modules/kubernetes/ingress_factory"
+  auth            = "required"
+  dns_type        = "proxied"
+  namespace       = kubernetes_namespace.t3_afk.metadata[0].name
+  name            = "t3-afk"
+  service_name    = kubernetes_service.t3_afk.metadata[0].name
+  port            = 3773
+  tls_secret_name = var.tls_secret_name
+}
--- a/stacks/t3-afk/terragrunt.hcl
+++ b/stacks/t3-afk/terragrunt.hcl
@ -0,0 +1,18 @@
+include "root" {
+  path = find_in_parent_folders()
+}
+
+dependency "platform" {
+  config_path  = "../platform"
+  skip_outputs = true
+}
+
+dependency "vault" {
+  config_path  = "../vault"
+  skip_outputs = true
+}
+
+dependency "external-secrets" {
+  config_path  = "../external-secrets"
+  skip_outputs = true
+}
--- a/stacks/terminal/main.tf
+++ b/stacks/terminal/main.tf
@ -225,8 +225,11 @@ module "ingress_ro" {
 #   https://forgejo.viktorbarzin.me/viktor/terminal-lobby
 #
 # That repo's ./scripts/deploy.sh ships everything to wizard@10.0.10.10
-# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. This stack
-# only owns the Kubernetes side: Services, Endpoints pointing at
+# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. Deploy is
+# MANUAL via that script — there is no CI pipeline (the lobby's
+# .woodpecker.yml was removed under ADR-0002, issue #31; it builds no
+# image, so it is not part of the GHA->ghcr fleet). This stack only owns
+# the Kubernetes side: Services, Endpoints pointing at
 # 10.0.10.10:{7681,7682,7683,7684}, the IngressRoutes, and the Traefik
 # middlewares that gate everything behind Authentik forward-auth.
 #
--- a/stacks/traefik/modules/traefik/middleware.tf
+++ b/stacks/traefik/modules/traefik/middleware.tf
@ -344,6 +344,31 @@ resource "kubernetes_manifest" "middleware_tripit_rate_limit" {
  depends_on = [helm_release.traefik]
 }

+# Health-specific rate limit. The redesigned, data-dense SPA loads the shell
+# (JS chunks + two self-hosted Geist woff2) plus a 5-8 call API burst per page,
+# and fast tab-to-tab navigation from one client IP blows past the default
+# 10/50 limiter — 429ing the tail so cards/pages render empty (fifth instance
+# of the burst pattern, after ha-sofia, ActualBudget, noVNC and tripit). Burst
+# absorbs a couple of full page loads back-to-back.
+resource "kubernetes_manifest" "middleware_health_rate_limit" {
+  manifest = {
+    apiVersion = "traefik.io/v1alpha1"
+    kind       = "Middleware"
+    metadata = {
+      name      = "health-rate-limit"
+      namespace = kubernetes_namespace.traefik.metadata[0].name
+    }
+    spec = {
+      rateLimit = {
+        average = 100
+        burst   = 1000
+      }
+    }
+  }
+
+  depends_on = [helm_release.traefik]
+}
+
 # Compress responses to clients at the entrypoint level (outermost).
 # Applied at websecure entrypoint so all responses get compressed.
 # Uses includedContentTypes (whitelist) instead of excludedContentTypes:
--- a/stacks/tripit/main.tf
+++ b/stacks/tripit/main.tf
@ -125,6 +125,11 @@ locals {
    # (older images crash-loop on the unknown enum) — landed after that
    # image rolled out, same hold-order as FARE/CALENDAR/RESEARCH above.
    CITY_IMAGE_PROVIDER = "wikipedia"
+    # Re-applied 2026-06-14: a69847a0 (the commit that added this) was never
+    # terraform-applied — its CI run skipped the tripit stack (changed-stack
+    # diff race), so the env never landed in-cluster and the provider fell back
+    # to the fake 1x1-PNG, leaving every trip/stay cover blank. This touch forces
+    # the tripit stack to re-apply and reconcile the drift.
    # Tour-guide content pipeline (tripit#24/#25): these three default to `fake`
    # in tripit's config, which is what shipped dark on 2026-06-08 — prod only
    # ever showed the placeholder "Sight 1". Real providers: Wikipedia GeoSearch
--- a/stacks/tuya-bridge/variables.tf
+++ b/stacks/tuya-bridge/variables.tf
@ -6,5 +6,5 @@ variable "tls_secret_name" {
 variable "image_tag" {
  type        = string
  default     = "latest"
-  description = "tuya_bridge image tag pushed to forgejo.viktorbarzin.me/viktor/tuya_bridge. Each Woodpecker run does `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)."
+  description = "tuya_bridge image tag at ghcr.io/viktorbarzin/tuya_bridge (built by GHA, ADR-0002). The GHA deploy job drives a Woodpecker `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)."
 }
--- a/stacks/uptime-kuma/CONTEXT.md
+++ b/stacks/uptime-kuma/CONTEXT.md
@ -0,0 +1,29 @@
+# Uptime Kuma — Context
+
+Glossary for the uptime-kuma monitoring context. Terms only — no implementation
+detail. Decisions live in `docs/adr/`.
+
+## Glossary
+
+**Active check (poll)** — Uptime Kuma actively probes a target on an interval
+(HTTP / TCP / ping / DB). This is *polling*, not "scraping." Prometheus *scrapes*
+exporters; Kuma *polls* targets. (Note: Prometheus does **not** scrape Kuma — a
+separate monitoring lane.)
+
+**Monitor** — one configured target plus its check definition.
+
+**Internal monitor** — probes a service on its in-cluster address
+(`*.svc.cluster.local`). Answers "is the service itself healthy?"
+
+**`[External]` monitor** — probes a service via its full public path
+(DNS → Cloudflare → cloudflared tunnel → Traefik). Answers "is the service
+reachable the way users reach it?" Maintained one-per-externally-reachable-service
+by deliberate choice (see ADR-0001).
+
+**Heartbeat** — one recorded check result (up/down + latency), persisted to the
+datastore.
+
+**External-access divergence** — the condition where a service is healthy
+*internally* but its `[External]` path is down — i.e. the shared
+Cloudflare/tunnel/Traefik path is broken while the service itself is fine.
+Surfaced by the `ExternalAccessDivergence` alert.
--- a/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md
+++ b/stacks/uptime-kuma/docs/adr/0001-uptime-kuma-sizing-and-placement.md
@ -0,0 +1,45 @@
+# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement
+
+## Status
+Accepted (2026-06-13)
+
+## Context
+A review was prompted by a suspicion that Kuma was "scraping too much / causing
+unnecessary traffic," itself triggered by a socket.io login-timeout incident on
+the monitor-sync CronJobs. Measured state at review time:
+
+- **227 active monitors**; 209 of them at 300s intervals; **~1 check/sec** aggregate.
+- Datastore: the **shared `mysql.dbaas`** (MariaDB), **~77 MB**, ~1 heartbeat
+  write/sec, 30-day retention.
+- **122 `[External]` monitors** (full public path) + ~105 internal.
+
+The data did **not** support a load problem — Kuma is already lean. The
+login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event
+loop briefly stalling), fixed separately by wrapping login in a retry — not a
+load issue.
+
+## Decisions
+1. **Keep Kuma as-is; do not reflexively cut monitors or intervals.** Poll rate
+   (~1/s) and DB footprint (77 MB) are modest.
+2. **`[External]` monitors stay per-service** (one per externally-reachable
+   service), **not** a small canary set. Rejected cutting to ~6-10 canaries:
+   although the Cloudflare → tunnel → Traefik path is shared infra that fails as a
+   unit, per-service external probes also catch *single-service* external
+   misconfig (one service's DNS / auth carve-out / route), which canaries miss.
+   The ~35k Cloudflare requests/day this generates is accepted for that coverage.
+3. **Datastore stays on the shared `mysql.dbaas`.** Rejected moving to
+   self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the
+   single-instance MySQL it also helps monitor, including during that MySQL's
+   8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as
+   low-impact for now.
+
+## Consequences
+- All three decisions are **cheap to reverse**; revisit if measured load on
+  `mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This
+  ADR exists mainly so that review isn't re-run from scratch.
+- **Known gap:** the *internal* monitor-sync creates/updates monitors but does
+  **not** prune orphans (the external sync does). Internal monitors for deleted
+  services linger and need periodic manual cleanup — e.g. the stale
+  "Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted
+  by hand on 2026-06-13. A *scoped* internal-prune (only deleting monitors the
+  sync owns, never hand-made ones) is a possible future improvement.
--- a/stacks/uptime-kuma/modules/uptime-kuma/main.tf
+++ b/stacks/uptime-kuma/modules/uptime-kuma/main.tf
@ -503,8 +503,27 @@ except (urllib.error.URLError, OSError, KeyError, ValueError) as e:

 print(f"Loaded {len(targets)} external monitor targets (source={source})")

-api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
-api.login("admin", UPTIME_KUMA_PASS)
+api = None
+for _login_try in range(1, 6):
+    try:
+        api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
+        api.login("admin", UPTIME_KUMA_PASS)
+        break
+    except Exception as _login_err:
+        # kuma 2.x's single Node event loop intermittently stalls under its
+        # ~300 monitors, so the socket.io login handshake times out. Retry a
+        # few times across a ~60s window to ride out the stall instead of
+        # failing the whole sync job (which fired JobFailed -> Slack noise).
+        print(f"WARN: Kuma login attempt {_login_try}/5 failed: {_login_err!r}")
+        if api is not None:
+            try:
+                api.disconnect()
+            except Exception:
+                pass
+            api = None
+        if _login_try == 5:
+            raise
+        time.sleep(15)

 monitors = api.get_monitors()
 existing_external = {}
@ -818,8 +837,27 @@ UPTIME_KUMA_PASS = os.environ["UPTIME_KUMA_PASSWORD"]
 with open("/config/targets.json") as f:
    targets = json.load(f)

-api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
-api.login("admin", UPTIME_KUMA_PASS)
+api = None
+for _login_try in range(1, 6):
+    try:
+        api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
+        api.login("admin", UPTIME_KUMA_PASS)
+        break
+    except Exception as _login_err:
+        # kuma 2.x's single Node event loop intermittently stalls under its
+        # ~300 monitors, so the socket.io login handshake times out. Retry a
+        # few times across a ~60s window to ride out the stall instead of
+        # failing the whole sync job (which fired JobFailed -> Slack noise).
+        print(f"WARN: Kuma login attempt {_login_try}/5 failed: {_login_err!r}")
+        if api is not None:
+            try:
+                api.disconnect()
+            except Exception:
+                pass
+            api = None
+        if _login_try == 5:
+            raise
+        time.sleep(15)

 existing = {m["name"]: m for m in api.get_monitors()}