Merge remote-tracking branch 'forgejo/master' into emo/fan-control-ha-actuator
All checks were successful
ci/woodpecker/push/default Pipeline was successful

This commit is contained in:
Emil Barzin 2026-06-16 08:08:27 +00:00
commit 5bc3d27d1b
42 changed files with 3072 additions and 387 deletions

File diff suppressed because one or more lines are too long

View file

@ -42,12 +42,13 @@
| webhook_handler | Webhook processing | webhook_handler |
| tuya-bridge | Smart home bridge | tuya-bridge |
| android-emulator | Shared Android 16 test emulator (adb 10.0.20.200:5555, noVNC android-emulator.viktorbarzin.lan) | android-emulator |
| anisette | Self-hosted Apple anisette-data server (Dadoum/anisette-v3-server, digest-pinned) for sideloading the TripIt iOS Shell via SideStore; internal-only http://anisette.viktorbarzin.lan, auth=none, LAN-only, stateless | anisette |
| dawarich | Location history | dawarich |
| owntracks | Location tracking | owntracks |
| nextcloud | File sync/share | nextcloud |
| calibre | E-book management (may be merged into ebooks stack) | calibre |
| onlyoffice | Document editing | onlyoffice |
| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05), Woodpecker-native build->deploy (repo id 166) | f1-stream |
| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); canonical source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05); GHA-built → `ghcr.io/viktorbarzin/f1-stream` (private), Woodpecker deploy-only (ADR-0002) | f1-stream |
| chrome-service | Headed Chromium over CDP (`http://chrome-service.chrome-service.svc:9222`, `connect_over_cdp`; legacy `:3000/<token>` WS pool removed 2026-06-04) for sibling services driving anti-bot pages — snapshot-harvester CronJob + tripit fare scrape | chrome-service |
| rybbit | Analytics | rybbit |
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |

36
.github/workflows/build-k8s-portal.yml vendored Normal file
View file

@ -0,0 +1,36 @@
name: Build k8s-portal
# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra
# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces
# the in-cluster .woodpecker/k8s-portal.yml build.
on:
push:
branches: [master]
paths:
- 'stacks/k8s-portal/modules/k8s-portal/files/**'
workflow_dispatch: {}
permissions:
contents: read
packages: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: stacks/k8s-portal/modules/k8s-portal/files
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/k8s-portal:latest
ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }}

View file

@ -3,10 +3,6 @@
"ha": {
"type": "http",
"url": "${HA_MCP_URL}"
},
"paperless": {
"type": "http",
"url": "http://paperless-mcp.paperless-mcp.svc.cluster.local/mcp"
}
}
}

View file

@ -1,49 +0,0 @@
when:
event: push
branch: master
path:
include:
- "stacks/platform/modules/k8s-portal/files/**"
clone:
git:
image: woodpeckerci/plugin-git
settings:
attempts: 5
backoff: 10s
steps:
- name: build-and-push
image: woodpeckerci/plugin-docker-buildx
settings:
username: "viktorbarzin"
password:
from_secret: dockerhub-pat
repo: viktorbarzin/k8s-portal
dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile
context: stacks/platform/modules/k8s-portal/files
platforms:
- linux/amd64
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
cache_from: "viktorbarzin/k8s-portal:latest"
cache_to: "type=inline"
- name: deploy
image: bitnami/kubectl:latest
commands:
- "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal"
- "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s"
- "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'"
- name: slack
image: curlimages/curl
commands:
- |
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \
"$SLACK_WEBHOOK" || true
environment:
SLACK_WEBHOOK:
from_secret: slack_webhook
when:
status: [success, failure]

View file

@ -173,13 +173,17 @@ The split where every owned image is built+pushed by GitHub Actions and Woodpeck
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002).
**Canonical repo**:
The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included.
_Avoid_: "upstream" (ambiguous); committing anywhere else.
The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target.
_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003).
**GitHub mirror**:
The GitHub repo a **Canonical repo** push-mirrors to, one-way, so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync.
The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost.
_Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror.
**GitHub-first repo**:
The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed.
_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**.
**Forgejo registry**:
Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io.
_Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it.

View file

@ -0,0 +1,30 @@
# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub
Status: accepted (extends ADR-0002)
## Context
Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/<name>`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model.
The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup.
## Decision
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
## Considered options
- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference.
- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood.
## Consequences
- Divergence becomes structurally impossible — one push target per repo.
- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided.
- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.)
- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging.

View file

@ -108,6 +108,31 @@ All new users must use an invitation link to register. The invitation-enrollment
Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.
### TripIt External self-signup (open enrollment, fenced)
Unlike every other app, **TripIt allows open public self-signup** for people
outside the homelab (ADR-0020 in the tripit repo; runbook
`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment`
flow (email + passkey, no password) creates the account and stamps it into the
parentless **`TripIt External`** group. Containment is two-layered:
- **Forward-auth apps**: a branch prepended to the `admin-services-restriction`
catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and
denies every other `auth="required"` host.
- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth).
External users are contained because every sensitive OIDC app already requires a
trusted group they do not hold — audited 2026-06-15:
Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo →
`Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove →
`Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless
`default`-policy token) and is bound to **`Allow Login Users`** as part of this
change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC).
**Invariants**: keep `TripIt External` parentless (never under `Allow Login
Users`); keep the catch-all branch first; never co-assign `TripIt External` to a
trusted/internal user; the `tripit-enrollment` user_write "Create users group"
setting is the keystone that tags every signup.
### OIDC Applications
Authentik provides OIDC for 10 applications:

View file

@ -2,334 +2,378 @@
## Overview
The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`.
**Doctrine (ADR-0002): all image builds and CI compute run OFF-infra.** Every
owned image is built, tested, and linted on **GitHub Actions** (free on public
repos; 2000 free min/mo on private) and pushed to **`ghcr.io/viktorbarzin/<name>`**.
Woodpecker is **deploy-only** — a GHA job POSTs its API with the freshly-built
image tag and Woodpecker runs `kubectl set image` from inside the cluster.
There are **no in-cluster image builds or CI test runs anywhere** — the
in-cluster Woodpecker buildkit and the fallback-build pattern were removed as a
clean cut (ADR-0002, 2026-06-13). The Forgejo container registry is **frozen
and emptied** — break-glass only.
This breaks the old circular dependency (images needed to repair the cluster
used to be built and stored *inside* it) and keeps build IO + registry pushes
off the homelab spindle.
## Architecture Diagram
```mermaid
graph LR
A[Git Push] --> B[GitHub Actions]
B --> C[Build Docker Image<br/>linux/amd64, 8-char SHA tag]
C --> D[Push to DockerHub]
D --> E[POST Woodpecker API]
E --> F[Woodpecker Pipeline]
F --> G[Vault K8s Auth<br/>SA JWT]
G --> H[kubectl set image]
H --> I[K8s Deployment]
I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
A[git push Forgejo<br/>viktor/&lt;repo&gt; canonical] --> B[push-mirror sync_on_commit]
B --> C[GitHub mirror<br/>ViktorBarzin/&lt;repo&gt;]
C --> D[GitHub Actions<br/>.github/workflows/build.yml]
D --> E[lint / test]
E --> F[buildx linux/amd64<br/>provenance:false]
F --> G[push ghcr.io/viktorbarzin/&lt;name&gt;<br/>:sha8 + :latest]
G --> H[svu tag -> Forgejo canonical]
G --> I[POST Woodpecker deploy repo]
I --> J[.woodpecker/deploy.yml<br/>event: manual]
J --> K[kubectl set image<br/>in-cluster SA cluster-admin]
K --> L[K8s Deployment<br/>pulls from ghcr]
K[Pull-Through Cache<br/>10.0.20.10] -.-> J
L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
style B fill:#2088ff
style F fill:#4c9e47
style K fill:#f39c12
style D fill:#2088ff
style J fill:#4c9e47
style G fill:#f39c12
```
## Components
| Component | Version | Location | Purpose |
|-----------|---------|----------|---------|
| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
| Component | Location | Purpose |
|-----------|----------|---------|
| GitHub Actions | `.github/workflows/build.yml` (per repo) | Build + lint + test + push image; trigger deploy; cut semver tag |
| ghcr.io | `ghcr.io/viktorbarzin/*` | Container registry for ALL owned images (public + private packages) |
| Woodpecker CI | `ci.viktorbarzin.me` | **Deploy-only**`kubectl set image` in-cluster; plus infra applies + maintenance crons |
| Forgejo | `forgejo.viktorbarzin.me/viktor/<repo>` | **Canonical** git source (push-mirrors to GitHub). Container registry **FROZEN** (break-glass only) |
| Pull-Through Cache | `10.0.20.10:5000/5010/5020/5030/5040` | LAN cache for upstream registries (DockerHub, ghcr, Quay, k8s.gcr, Kyverno) |
| Kyverno | `kyverno` namespace | Syncs `ghcr-credentials` (private-ghcr allowlist) + `registry-credentials` to namespaces |
| Vault | `vault.viktorbarzin.me` | K8s auth for Woodpecker deploy pipelines; CI tokens in `secret/ci/global` + `secret/viktor` |
## How It Works
### Build Flow (GitHub Actions)
### The fleet pattern (every owned app)
1. **Trigger**: Git push to main/master branch
2. **Build**: GHA builds Docker image for `linux/amd64` platform only
3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`)
- `:latest` tags are **never used** to prevent stale pull-through cache issues
4. **Push**: Image pushed to DockerHub public registry
5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA
1. **Canonical source = Forgejo** `viktor/<repo>`. A **push-mirror**
(`sync_on_commit`) pushes every commit to the GitHub mirror
`ViktorBarzin/<repo>`. The `.github/workflows/build.yml` is committed on
Forgejo and mirrors over.
2. **GHA `build` job** (triggers `on: push: branches: [master]` ONLY — feature
branches mirror but build/deploy nothing, the safety valve):
- lint + test
- `svu` computes the next `vX.Y.Z` from conventional commits and pushes the
tag back to **canonical Forgejo** (GHA secret `FORGEJO_GIT_TOKEN` =
write:repository PAT); `VERSION` is baked into the image
- `docker buildx` `linux/amd64`, **`provenance: false`** (single-manifest —
avoids the orphaned-index-children failure class), push
`ghcr.io/viktorbarzin/<name>:<sha8>` + `:latest`
- `delete-package-versions` keeps the newest ~10 ghcr versions
3. **GHA `deploy` job** POSTs `ci.viktorbarzin.me/api/repos/<id>/pipelines`
(the Woodpecker registration for the **GitHub mirror**, github-forge; GHA
secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + `IMAGE_NAME`.
4. **`.woodpecker/deploy.yml`** (event: **manual** only, so the raw
Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set
image deployment/<app> <container>=<image>` in-cluster. The `woodpecker-agent`
SA is `cluster-admin`, so the `bitnami/kubectl` step needs no
kubeconfig/RBAC. The Deployment image is in `lifecycle.ignore_changes`
(`KEEL_IGNORE_IMAGE`) so the SHA tag sticks and `terragrunt apply` doesn't
fight it. CronJobs in owned apps track `:latest` + `imagePullPolicy: Always`
instead of a deploy step.
### Deploy Flow (Woodpecker CI)
**Keel stays enrolled** as a redundant net (finds the deployed SHA already
running → no-op).
1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA
2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth
3. **Deploy**: `kubectl set image deployment/<name> <container>=viktorbarzin/<app>:<sha>`
4. **Notify**: Slack notification on success/failure
**Tooling**: `infra/scripts/offinfra-onboard` + `infra/scripts/offinfra-templates/`
scaffold a repo onto this pattern (mirror, workflow, Woodpecker deploy repo,
old-pipeline removal, default-branch flip). Mirror + workflow commits go via
the Forgejo API over the internal Traefik LB
(`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm
can't reach Forgejo's public hairpin.
### Project Migration Status
### ghcr package visibility
**Migrated to GHA (8 projects)**:
- Website
- k8s-portal
- claude-memory-mcp
- apple-health-data
- audiblez-web
- plotting-book
- insta2spotify
- book-search (audiobook-search)
| Visibility | Packages | Pull mechanism |
|------------|----------|----------------|
| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
**Woodpecker-native owned-app builds** (build + push to the Forgejo private
registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel
stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`.
`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on
2026-06-05 (Woodpecker repo id 166); the old github source is archived and its
GHA-era Woodpecker repo (id 10) is deactivated.
Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
**ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source
`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault
`secret/viktor/ghcr_pull_token` (a dedicated classic PAT scoped to
`read:packages`, UI-minted 2026-06-15 — no longer the admin `github_pat` alias.
GitHub has no token-mint API, so rotation is manual: re-mint the classic
`read:packages` PAT → `vault kv patch secret/viktor ghcr_pull_token=…`
targeted apply `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault;
avoids the git-crypt `tls-secret-sync` landmine on a locked clone), which
Kyverno then re-syncs to the allowlisted namespaces).
**Woodpecker-only (infra + large apps)**:
- `travel_blog`: 5.7GB content directory exceeds GHA limits
- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli)
### Migrated apps (issues #13#27)
### Woodpecker Pipeline Files
f1-stream, job-hunter, tuya_bridge, beadboard, nextcloud-todos,
claude-agent-service, claude-memory-mcp, kms-website, Freedify,
instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
audiobook-search, council-complaints) now also land on ghcr.
Each project contains:
- `.woodpecker/deploy.yml`: kubectl set image + Slack notification
- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires)
### Infra-owned images (issues #29 / #30)
### Woodpecker Repository IDs
Images owned by the infra repo build on GHA workflows **in the infra repo's own
`.github/workflows/`** (the github↔forgejo divergence was deliberately NOT
reconciled — the workflows were added to the GitHub lineage via PR):
Woodpecker API uses numeric IDs (not owner/name):
| Image | Workflow | Destination |
|-------|----------|-------------|
| chrome-service-novnc | `build-chrome-service-novnc.yml` | public `ghcr.io/viktorbarzin/chrome-service-novnc` |
| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
| Repo | ID |
|------|------|
| infra | 1 |
| Website | 2 |
| finance | 3 |
| health | 4 |
| travel_blog | 5 |
| webhook-handler | 6 |
| audiblez-web | 9 |
| plotting-book | 43 |
| claude-memory-mcp | 78 |
| infra-onboarding | 79 |
**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
already built by tripit's GHA → ghcr.
### Image Registry Flow
The Woodpecker `build-ci-image.yml` and `build-cli.yml` pipelines were
**REMOVED**. Break-glass for infra-ci is now a manual
`.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM).
1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
### Forgejo container registry — FROZEN
### Infra Pipelines (Woodpecker-only)
Issue #32 wiped all `viktor/*` container packages (~19G reclaimed, `/data`
58%→20%). The registry is **break-glass-only** now; nothing pushes to it. The
`forgejo-cleanup` CronJob stays in `DRY_RUN` (nothing to clean). Pull-through
caches on the registry VM (`10.0.20.10`) are unchanged. See
`docs/runbooks/forgejo-registry-breakglass.md`.
### Image registry / pull path
1. **Containerd `hosts.toml`** redirects pulls from docker.io and ghcr.io to the
pull-through cache at `10.0.20.10` (5000 = docker.io, 5010 = ghcr.io).
2. **Pull-through cache** serves cached images from the LAN, fetches upstream on
a miss.
3. **Kyverno ClusterPolicies** sync `ghcr-credentials` (private-ghcr allowlist)
and `registry-credentials` to namespaces.
## Woodpecker — what it still runs
Woodpecker is **deploy + cluster-touching steps only**:
| Pipeline | File | Purpose |
|----------|------|---------|
| default | `.woodpecker/default.yml` | Terragrunt apply on push |
| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE host |
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*``10.0.20.10` on change |
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports``/etc/exports` on PVE |
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new post-mortems |
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered deploy for the portal |
| breakglass-infra-ci | `.woodpecker/breakglass-infra-ci.yml` | **Manual** ghcr pull-and-save of infra-ci to the registry VM |
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
### Woodpecker API
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
(those return HTML). The deploy registration for each app is the **GitHub
mirror** repo (registered github-forge). IDs are stable across renames and must
be looked up from the Woodpecker UI/DB.
### Woodpecker YAML gotchas
- Commands with `${VAR}:${VAR}` must be **quoted** — an unquoted `:` triggers
YAML map parsing when the vars are empty.
- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility).
- Global secrets must include `manual` in their events list for API-triggered
pipelines.
### GitHub repo secrets
Per repo: `WOODPECKER_TOKEN` (POST the deploy pipeline), `FORGEJO_GIT_TOKEN`
(write:repository PAT for the `svu` tag push). ghcr push uses the workflow's
built-in `GITHUB_TOKEN` (`packages: write`).
## Infra repo CI topology
The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
Slack audit step. Operational facts (2026-06-10):
- **Webhook URL is the IN-CLUSTER service**:
`http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
via the Forgejo API). The Woodpecker default (`https://ci.viktorbarzin.me/...`)
resolves to the non-proxied public A record from pods → NAT hairpin →
intermittent `context deadline exceeded`, silently dropping push events. If
Woodpecker "repairs" the repo it rewrites the hook back to `ci.viktorbarzin.me`
— re-apply the in-cluster URL.
- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, …).
When registering a new forge repo for infra, clone the secret set too.
- **Empty commits defeat path filters**: a commit with no changed files makes
Woodpecker include ALL workflow files (path conditions can't exclude), so every
repo secret must resolve. Normal commits with real files only compile the
matching workflows.
The Forgejo trigger is not fully dependable — land infra changes by pushing
Forgejo master (as viktor), use `[ci skip]` for docs/no-op commits, and verify
deploys via `scripts/tg` + live cluster state rather than trusting the CI
checkmark. The two remotes have **diverged** (parallel histories under
different SHAs); expect github pushes to reject non-fast-forward and leave them
— never force-push.
## Configuration
### GitHub Actions
**File**: `.github/workflows/build-and-deploy.yml`
### GitHub Actions (per-app `.github/workflows/build.yml`)
```yaml
name: Build and Deploy
name: build
on:
push:
branches: [main, master]
branches: [master]
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: write # svu tag push
packages: write # ghcr push
steps:
- name: Build Docker image
run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} .
- name: Push to DockerHub
run: docker push viktorbarzin/app:${SHORT_SHA}
- name: Trigger Woodpecker Deploy
- uses: actions/checkout@v4
- name: lint + test
run: make lint test
- name: svu tag -> Forgejo
run: |
curl -X POST https://ci.viktorbarzin.me/api/repos/<REPO_ID>/pipelines \
-H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}"
VERSION=$(svu next)
# ... push tag to canonical Forgejo with FORGEJO_GIT_TOKEN
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v6
with:
platforms: linux/amd64
provenance: false
push: true
tags: |
ghcr.io/viktorbarzin/<name>:${{ github.sha }}
ghcr.io/viktorbarzin/<name>:latest
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- name: Trigger Woodpecker deploy
run: |
curl -X POST https://ci.viktorbarzin.me/api/repos/<DEPLOY_REPO_ID>/pipelines \
-H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \
-d '{"branch":"master","variables":{"IMAGE_TAG":"...","IMAGE_NAME":"..."}}'
```
**Required GitHub Secrets**:
- `DOCKERHUB_USERNAME`
- `DOCKERHUB_TOKEN`
- `WOODPECKER_TOKEN`
### Woodpecker Deploy Pipeline
**File**: `.woodpecker/deploy.yml`
### Woodpecker deploy pipeline (per-app `.woodpecker/deploy.yml`)
```yaml
when:
event: [deployment]
event: manual
steps:
deploy:
image: bitnami/kubectl:latest
image: bitnami/kubectl:latest # uses the in-cluster woodpecker-agent SA (cluster-admin)
commands:
- kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8}
secrets: [k8s_token]
- "kubectl set image deployment/app app=${IMAGE_NAME}:${IMAGE_TAG} -n <ns>"
- "kubectl rollout status deployment/app -n <ns> --timeout=300s"
notify:
image: plugins/slack
settings:
webhook: ${SLACK_WEBHOOK}
when:
status: [success, failure]
```
**YAML Gotchas**:
- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty
- Use `bitnami/kubectl:latest` (not pinned versions)
- Global secrets must be manually added to `secrets:` list in pipeline
### CI/CD secrets sync
### Vault Configuration
**K8s Auth for Woodpecker**:
- Woodpecker pipelines authenticate using ServiceAccount JWT
- Vault K8s auth mount validates JWT and issues token
- Policies grant access to secrets and dynamic credentials
### CI/CD Secrets Sync
**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours
- Keeps Woodpecker global secrets in sync with Vault
- Runs in `woodpecker` namespace
## Infra repo CI (Woodpecker repo 82 — Forgejo forge)
The infra repo itself runs on Woodpecker via the **Forgejo** forge (repo id 82,
registered 2026-06-08; the GitHub-side repo id 1 also remains registered).
Pushes to `master` fire `.woodpecker/default.yml` (changed-stacks terragrunt
apply) plus the `notify-nonadmin-push` Slack audit step (allow-then-audit
contribution model — see `multi-tenancy.md`). Operational facts (2026-06-10):
- **Webhook URL is the IN-CLUSTER service**: `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...`
(PATCHed via the Forgejo API). The Woodpecker-generated default
(`https://ci.viktorbarzin.me/...`) resolves to the non-proxied public A
record from pods → NAT hairpin → intermittent `context deadline exceeded`,
silently dropping push events (found when a push produced no pipeline).
If Woodpecker ever "repairs" the repo it will rewrite the hook back to
`ci.viktorbarzin.me` — re-apply the in-cluster URL (or pin `ci.viktorbarzin.me`
in the CoreDNS pod carve-out alongside forgejo).
- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`,
…). Repo 82 was registered without them and every all-workflow compile
errored with `secret "registry_ssh_key" not found`. Fixed by cloning repo-1
rows to repo 82 in the Woodpecker DB (`insert into secrets … select … where
repo_id=1`). When registering a new forge repo for infra, clone the secret
set too.
- **Empty commits defeat path filters**: a commit with no changed files makes
Woodpecker include ALL workflow files (path conditions can't exclude), so
every repo secret must resolve. Normal commits with real files only compile
the matching workflows.
A CronJob in the `woodpecker` namespace pushes `secret/ci/global` from Vault →
the Woodpecker API every 6h, keeping global secrets in sync. Woodpecker deploy
pipelines authenticate to the cluster via the in-cluster `woodpecker-agent` SA
(cluster-admin); Vault K8s auth backs any secret reads.
## Decisions & Rationale
### Why GitHub Actions + Woodpecker?
### Why all builds off-infra (ADR-0002)?
**Alternatives considered**:
1. **Woodpecker-only**: Simple, but wastes cluster resources on builds
2. **GHA-only**: No cluster access, requires kubectl from outside (security risk)
3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access)
- **Breaks the circular dependency** — the images needed to repair the cluster
no longer live inside it (they're on ghcr, an external registry).
- **Removes build IO + registry push load** from the contended homelab spindle.
- GHA is free on public repos and generous on private; buildx provenance:false
sidesteps the orphaned-index-children failure class that plagued the
in-cluster registry.
- **Clean cut** — no in-cluster fallback builds anywhere; one pattern,
fleet-wide.
**Benefits**:
- Free compute for builds on public repos
- Cluster access stays internal (Woodpecker has direct K8s access)
- Separation of concerns: build vs deploy
### Why ghcr (not push back to Forgejo)?
### Why 8-Character SHA Tags (Not :latest)?
Forgejo's container registry repeatedly orphaned OCI index children
(2026-04-13/19, 2026-05-04, 2026-06-10) and its retention is not container-aware.
ghcr is external (DR-safe), free for this scale, and has native multi-arch
handling. The Forgejo registry was frozen + emptied (issue #32).
- Pull-through cache serves stale `:latest` tags indefinitely
- SHA tags ensure every deployment pulls the correct image
- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations)
### Why Woodpecker stays for deploy?
### Why Numeric Repo IDs for Woodpecker API?
`kubectl set image` needs in-cluster privileged access; doing it from GHA would
mean exposing kube-apiserver or a long-lived kubeconfig. Woodpecker's
`woodpecker-agent` SA is already cluster-admin in-cluster — the deploy step
needs no credentials.
- Woodpecker API requires numeric IDs (not owner/name slugs)
- IDs are stable across repo renames
- Must be manually looked up from Woodpecker UI or database
### Why `event: manual` on deploy.yml?
### Why linux/amd64 Only?
The Forgejo→GitHub push-mirror sends raw, tag-less pushes to the GitHub mirror.
If `deploy.yml` fired on `push`, every mirror sync would trigger a deploy with no
image tag. `manual` means only the GHA `deploy` job's explicit API POST (with
`IMAGE_TAG`) deploys.
- Cluster runs on x86_64 nodes only
- ARM builds would waste time and storage
- Multi-arch images add complexity without benefit
### Why linux/amd64 only?
The cluster runs on x86_64 nodes only; ARM builds waste time and storage.
## Troubleshooting
### GHA Build Fails: "denied: requested access to the resource is denied"
### GHA build fails: ghcr push "denied"
**Cause**: DockerHub credentials expired or incorrect
The workflow `GITHUB_TOKEN` needs `packages: write` permission and the package
must allow the repo to push. Check the workflow `permissions:` block and the
package's "Manage Actions access" settings.
### Image pull fails: "ErrImagePull" / "ImagePullBackOff"
**Fix**:
```bash
# Regenerate DockerHub token
# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN
# Public image — check the pull-through cache is up
curl http://10.0.20.10:5010/v2/_catalog
# Private image — verify the ghcr-credentials Secret exists in the namespace
kubectl get secret ghcr-credentials -n <namespace>
# It's Kyverno-synced to an allowlist; if missing, the namespace isn't on the
# allowlist in stacks/kyverno/modules/kyverno/ghcr-credentials.tf
```
### Woodpecker Deploy Fails: "Unauthorized"
If the cause is the internal-DNS hairpin (fresh pulls timing out on the public
Forgejo path), see the CoreDNS `viktorbarzin.me` carve-out in
`docs/architecture/networking.md` and `docs/runbooks/registry-vm.md`.
**Cause**: Vault K8s auth token expired or invalid
### Deploy didn't happen after a push
**Fix**:
```bash
# Restart Woodpecker pipeline (token auto-renewed)
# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer
```
Confirm the push was to **master** (feature branches build/deploy nothing).
Check the GHA run completed the `deploy` job, then check Woodpecker received the
manual pipeline (`ci.viktorbarzin.me`, the GitHub-mirror deploy repo). Verify
live with `kubectl rollout status` — not the CI checkmark.
### Image Pull Fails: "ErrImagePull"
### Woodpecker deploy fails: "YAML: did not find expected key"
**Cause**: Pull-through cache or registry credentials issue
**Fix**:
```bash
# Check pull-through cache is running
curl http://10.0.20.10:5000/v2/_catalog
# Verify registry-credentials Secret exists in namespace
kubectl get secret registry-credentials -n <namespace>
# Manually sync credentials if missing
kubectl get secret registry-credentials -n default -o yaml | \
sed 's/namespace: default/namespace: <namespace>/' | kubectl apply -f -
```
### Woodpecker Pipeline: "YAML: did not find expected key"
**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty
**Fix**: Quote the command:
```yaml
commands:
- "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}"
```
### travel_blog Build Times Out on GHA
**Cause**: 5.7GB content directory exceeds GHA disk/time limits
**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources.
### CI/CD Secrets Out of Sync
**Cause**: CronJob failed to sync Vault → Woodpecker
**Fix**:
```bash
# Check CronJob status
kubectl get cronjob -n woodpecker
# Manually trigger sync
kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker
```
Unquoted command with `${VAR}:${VAR}` syntax when a VAR is empty. Quote the
command (see the deploy.yml example above).
## Related
- [Databases Architecture](./databases.md) — Database credentials via Vault
- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access
- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app
- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues
- Vault documentation: K8s auth configuration
- Woodpecker documentation: API reference
- ADR: `../adr/0002-all-image-builds-off-infra-gha-ghcr.md` — the decision
- [Databases Architecture](./databases.md) — database credentials via Vault
- [Multi-Tenancy](./multi-tenancy.md) — per-user Woodpecker access
- Runbook: `../runbooks/forgejo-registry-breakglass.md` — using the frozen registry
- Runbook: `../runbooks/registry-vm.md` — pull-through cache VM + image-pull debugging
- Onboarding tool: `../../scripts/offinfra-onboard` + `../../scripts/offinfra-templates/`

View file

@ -543,6 +543,10 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).
**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude``~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).
**Contribute access (per non-admin, manual — the anca/tripit PAT precedent):**

View file

@ -0,0 +1,226 @@
# Runbook — TripIt external user self-signup (email + passkey)
Implements ADR-0020 (tripit repo): people outside the homelab self-register to
TripIt with **email + a passkey** (no password), are auto-tagged into the
**`TripIt External`** Authentik group, and are fenced to `tripit.viktorbarzin.me`
only. Audience: people Viktor knows; open public registration.
> **Safety model.** Containment is two-layered. (1) **Forward-auth apps** — the
> branch in `stacks/authentik/admin-services-restriction.tf` admits `TripIt
> External` to `tripit.viktorbarzin.me` and denies every other `auth="required"`
> host. (2) **OIDC apps** — the branch does NOT cover OIDC (it bypasses
> forward-auth); External users are contained because every sensitive OIDC app
> already requires a trusted group they do not hold (audit below). The no-lockout
> guarantee is that the group is created **empty**, so the new branch matches
> zero existing users on day one.
## OIDC app authorization audit (2026-06-15, read-only)
A parentless `TripIt External` user holds NONE of these groups, so:
| OIDC app | Requires | External user |
|---|---|---|
| Immich, Grafana, Linkwarden, Cloudflare Access | `Home Server Admins` | DENIED ✓ |
| Forgejo | `Task Submitters` / `Forgejo Users` | DENIED ✓ |
| Headscale | `Headscale Users` | DENIED ✓ |
| wrongmove | `Wrongmove Users` | DENIED ✓ |
| **Vault** | **was OPEN** → bound to `Allow Login Users` in Step 3 | DENIED after Step 3 |
| Kubernetes, Kubernetes Dashboard | OPEN | harmless — apiserver rejects OIDC tokens (idle) |
| TripIt App, Public | OPEN | by design (TripIt's own provider / guest) |
Vault's JWT `default` role grants only Vault's built-in `default` policy (token
self-management, cubbyhole — **no** secret access), so the pre-fix exposure was a
near-powerless token; Step 3 closes it anyway.
---
## Pre-flight gates (STOP if any fails)
1. **`TripIt External` is net-new / empty** (no-lockout precondition):
```
kubectl -n authentik exec -i deploy/goauthentik-server -- ak shell <<'PY'
from authentik.core.models import Group
g = Group.objects.filter(name="TripIt External").first()
print("exists:", bool(g), "members:", g.users.count() if g else 0)
PY
```
Expect `exists: False`. If it exists with members → STOP.
2. **Authentik image pin matches live (B5)** — the policy edit auto-applies the
whole `authentik` stack; a stale pin re-triggers the 2026-06-10 downgrade
boot-storm:
```
kubectl -n authentik get deploy -o custom-columns=N:.metadata.name,IMG:.spec.template.spec.containers[0].image
```
Every `goauthentik`/`ak-outpost` image tag MUST equal
`stacks/authentik/modules/authentik/values.yaml` `global.image.tag`
(currently `2026.2.4`). If they differ → refresh the pin first.
---
## Step 1 — Terraform (group + fence branch)
Already written on this branch:
- `stacks/authentik/tripit-external.tf` — the empty, parentless group.
- `stacks/authentik/admin-services-restriction.tf` — the prepended fence branch.
**Local plan gate (B4 — CI auto-applies on push with `-auto-approve`, so there is
NO human plan review in the apply path; do it here):**
```
vault login -method=oidc
cd stacks/authentik && ../../scripts/tg plan
```
Confirm the plan is **exactly**:
- `+ authentik_group.tripit_external` (create)
- `~ authentik_policy_expression.admin_services_restriction` (update in place — the
`expression` body gains ONLY the new branch; every other line byte-identical)
- **`Plan: 1 to add, 1 to change, 0 to destroy.`**
ABORT if the plan shows any destroy/replace, any `authentik_provider_*` /
`authentik_outpost` / `authentik_flow*` / `helm_release`, or any other expression
change.
**Apply** (presence-claim courtesy, then push = apply; land human-watched, B5):
```
~/code/scripts/presence claim stack:authentik --purpose "ADR-0020 TripIt External group + fence branch"
# push the branch to master (this triggers CI tg apply on the authentik stack)
```
Watch: GHA → Woodpecker `default.yml` apply → outpost stays healthy
(`kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost` = 2
IPs; an anonymous request to any `auth=required` host still 302s to Authentik).
The branch is inert (empty group) so no access changes yet.
---
## Step 2 — Authentik SMTP (B1, BLOCKER before any flow)
Email verification is the **entire identity boundary** (TripIt trusts the
Authentik email verbatim). Authentik currently has the **default/unconfigured**
transport (`email.host = localhost`), so verification/recovery mail cannot send.
Add to **both** `server.env` and `worker.env` in
`stacks/authentik/modules/authentik/values.yaml` (wire the password from a secret;
the cluster mailserver is what TripIt already relays through —
`mailserver.mailserver.svc`):
```yaml
- { name: AUTHENTIK_EMAIL__HOST, value: "mailserver.mailserver.svc" }
- { name: AUTHENTIK_EMAIL__PORT, value: "587" }
- { name: AUTHENTIK_EMAIL__USE_TLS, value: "true" }
- { name: AUTHENTIK_EMAIL__FROM, value: "noreply@viktorbarzin.me" }
- { name: AUTHENTIK_EMAIL__USERNAME, value: "<relay user>" } # confirm relay creds
- { name: AUTHENTIK_EMAIL__PASSWORD, valueFrom: { secretKeyRef: { name: <secret>, key: <key> } } }
```
**Gate:** after apply, Authentik UI → System → Settings (or an Email stage) →
**Send test email**; it must arrive. Then prove enrollment cannot complete for an
address you do NOT control.
---
## Step 3 — Bind Vault → `Allow Login Users` (close the one open OIDC gap)
Authentik UI → Applications → **Vault** → bind an authorization policy requiring
group **`Allow Login Users`** (the base group every real homelab user inherits;
parentless `TripIt External` is excluded). This changes nothing for existing
users and denies External users at the Vault consent step.
Verify: an External test account (Step 6) cannot complete Vault OIDC login.
---
## Step 4 — Build the flows (Authentik UI; UI-managed per ADR split)
All three flows: designation as noted, no password stage.
**Flow `tripit-enrollment`** (Enrollment):
| Order | Stage | Key settings |
|---|---|---|
| 5 | Captcha | reCAPTCHA **v2 checkbox** keys (v3/invisible fail — see `crowdsec-recaptcha-key-type`) |
| 10 | Identification | email only; **no** `password_stage`; `sources` optional |
| 20 | Email (verification) | activate, blocking — **before** user_write |
| 30 | WebAuthn authenticator setup | `user_verification = required`, `resident_key = required` |
| 40 | User Write | **`create_users_group` = `TripIt External`** (the keystone tag); `user_type = external` |
| 50 | User Login | session as default (`weeks=4`) |
**Flow `tripit-login`** (Authentication, passwordless):
Identification (sets `enrollment_flow`/`recovery_flow`) → Authenticator
Validation (`device_classes = [webauthn]`, `user_verification = required`) → User
Login. Prefer routing a passkey-less email to recovery over minting a credential.
**Flow `tripit-recovery`** (Recovery):
Identification (`pretend_user_exists = on`) → Email (recovery link) → WebAuthn
authenticator setup → User Login. Notify the account on recovery + new-passkey.
> Do **NOT** bind the `brute-force-protection` ReputationPolicy to these flows —
> it denies anonymous users (2026-04-06 regression). The Captcha is the bot gate.
---
## Step 5 — Surface "Sign up"
Recommended: a **TripIt-scoped** signup link / share-invite rather than a global
login-screen button (narrower bot surface). Enrollment URL:
`https://authentik.viktorbarzin.me/if/flow/tripit-enrollment/`.
---
## Step 6 — Verification (before/after — "all access keeps working")
Hosts for the matrix (must be real `auth="required"` default-allow hosts, NOT
`auth="app"` apps like immich/nextcloud which bypass the catch-all):
`tripit`, `family`, `hackmd`, `health` (default-allow) + `terminal` (admin-only).
**Before** (capture per user, no redirect-follow; 200=ALLOW, 302→authentik/403=DENY):
```
COOKIE='authentik_session=<paste for this user>'; for H in tripit family hackmd health terminal; do
printf '%-10s %s\n' "$H" "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: $COOKIE" https://$H.viktorbarzin.me/)"; done
```
Representative non-admin: `kadir.tugan@gmail.com` (Wrongmove-only) → tripit/family/hackmd/health ALLOW, terminal DENY. Admin `vbarzin@gmail.com` → all ALLOW.
**After Step 1 apply — regression:** re-run identically; both users' results MUST
be unchanged (diff empty).
**After flows — external smoke test (the security proof):** enrol a throwaway
account via the enrollment URL (email verify + passkey). Confirm it is tagged
`TripIt External`, then with its cookie:
```
for H in tripit family hackmd health terminal frigate; do printf '%-10s %s\n' "$H" \
"$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: authentik_session=<external>" https://$H.viktorbarzin.me/)"; done
```
Expect **tripit=200, every other host DENY** (family/hackmd/health were ALLOW for
kadir — the contrast is the fence proof). Then:
- **OIDC containment:** with the external account, attempt OIDC login to Vault,
Immich, Forgejo, Grafana → each must be DENIED at the app's own login.
- **Auto-provision:** the TripIt `users` row exists (CNPG primary in ns `dbaas`:
`select id,email from tripit.users where email='<throwaway>'`).
- **Walling-off guard** `AuthentikWallingOffPublicPath` stays green.
**Any 200 on a non-tripit host, or any OIDC app admitting the external account →
ROLLBACK.**
---
## Step 7 — Standing regression probe (recommended)
Add a permanent `TripIt External` identity to the `blackbox-exporter` guard
(`stacks/monitoring/.../authentik_walloff_probe.tf` pattern): assert 200 on
`tripit.viktorbarzin.me` AND DENY on `family.viktorbarzin.me`. This converts the
"branch stays first" and "user_write keeps the keystone tag" invariants into
automated `#security` alerts.
---
## Rollback
Revert the `admin-services-restriction.tf` expression (delete the branch) and push
(= apply); removing a prepended `if g: return …` is behaviour-preserving on
non-members, restoring prior authz. Disable/delete the throwaway external account
(with the branch gone, a tagged account falls into default-allow). The empty group
may stay (harmless). Plan-gate the revert too.
## Operational invariants
- `TripIt External` stays **parentless** (never under `Allow Login Users`).
- The fence branch stays **first** in `admin-services-restriction`.
- **Never** co-assign `TripIt External` to a trusted/internal user.
- The `tripit-enrollment` user_write **`create_users_group`** setting is the
keystone — re-verify after any flow edit (clearing it makes UNtagged accounts
that fall into default-allow).
- Authentik SMTP is a live dependency of enrollment + recovery.

View file

@ -270,6 +270,43 @@ install_user_claude_token() {
log "shared Claude token -> $user (t3-serve env; restart needed to take effect)"
}
# Re-deploy the managed per-user Claude launcher to ~/start-claude.sh. /etc/skel only
# seeds it at account creation (setup-devvm.sh), so without this a launcher edit never
# reaches EXISTING users — they keep running a stale copy. Copy-if-changed from the repo's
# skel/, owned by the user, 0755. (We deliberately do NOT re-copy .tmux.conf: terminal-lobby
# appends a managed persistence section to each user's ~/.tmux.conf that a re-copy would clobber.)
deploy_user_launcher() {
local user="$1" home src dst
src="$WORKSTATION_DIR/skel/start-claude.sh"
home="$(getent passwd "$user" | cut -d: -f6)"
[[ -n "$home" && -d "$home" && -f "$src" ]] || return 0
dst="$home/start-claude.sh"
cmp -s "$src" "$dst" 2>/dev/null && return 0 # already current -> no churn
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] deploy launcher -> $dst"; return 0; fi
install -m 0755 "$src" "$dst"
chown "$user:$user" "$dst"
log "deployed start-claude.sh -> $user"
}
# Ensure the per-user NATIVE claude install (the recommended runtime: ~user/.local/bin/claude,
# self-updating) — used by BOTH the terminal launcher AND the user's t3-serve instance. We do
# NOT npm-install claude system-wide (npm/npx isn't the recommended runtime); each user gets
# their own native install. Idempotent: skip if already present. Runs the official native
# installer AS the user (into their ~/.local). Best-effort: a failure WARNs and retries next
# reconcile (start-claude.sh also self-bootstraps the terminal path).
install_user_claude_native() {
local user="$1" home
home="$(getent passwd "$user" | cut -d: -f6)"
[[ -n "$home" && -d "$home" ]] || return 0
[[ -x "$home/.local/bin/claude" ]] && return 0 # already native -> done
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] native claude install -> $user"; return 0; fi
if runuser -u "$user" -- bash -lc 'curl -fsSL https://claude.ai/install.sh | bash' >/dev/null 2>&1; then
log "installed native claude -> $user"
else
log "WARN: native claude install failed for $user (retries next reconcile)"
fi
}
[[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; }
for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done
[[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; }
@ -346,8 +383,10 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
fi
install_user_kubeconfig "$os_user"
install_user_claude_token "$os_user"
deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts)
fi
refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd
install_user_claude_native "$os_user" # all tiers — per-user native claude (terminal + t3); no npm/npx
done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (if (.groups|length)==0 then "-" else (.groups|join(",")) end), .code_layout, (if (.repos|length)==0 then "-" else (.repos|join(",")) end)] | @tsv' "$desired_file")
# 5) per-user .env (sticky port) + enable t3-serve@

View file

@ -21,7 +21,13 @@ export DEBIAN_FRONTEND=noninteractive
apt-get update -qq
apt-get install -y "${PKGS[@]}" >/dev/null
# 2) node >= 18 + claude-code (claude-code requires node >= 18)
# 2) node >= 18 — needed for the t3 CLI (npm-global, below). NOT for claude-code:
# claude-code is the per-user NATIVE install (the recommended, self-updating
# ~/.local/bin/claude), provisioned per user by t3-provision-users
# (install_user_claude_native) and self-bootstrapped by start-claude.sh on first launch.
# We deliberately do NOT `npm install -g @anthropic-ai/claude-code` — npm/npx is not the
# recommended runtime, and a system-wide npm copy just shadows/duplicates the per-user
# native installs everyone auto-migrates to anyway.
need_node=1
if command -v node >/dev/null; then
[[ "$(node -v | sed 's/^v\([0-9]*\).*/\1/')" -ge 18 ]] && need_node=0
@ -31,14 +37,23 @@ if [[ $need_node -eq 1 ]]; then
curl -fsSL https://deb.nodesource.com/setup_22.x | bash - >/dev/null
apt-get install -y nodejs >/dev/null
fi
# Detect the GLOBAL npm package, NOT whatever `claude` resolves to on PATH: the admin's
# personal ~/.local/bin/claude shadows it, so `command -v claude` silently skipped the
# system-wide install — leaving /usr/lib/node_modules/@anthropic-ai empty and fresh
# non-admins with no claude (they only worked because the admin's install was on PATH).
if ! npm ls -g --depth=0 @anthropic-ai/claude-code >/dev/null 2>&1; then
log "npm: installing @anthropic-ai/claude-code (system-wide)"
npm install -g @anthropic-ai/claude-code >/dev/null
fi
# 2a) ~/.local/bin on PATH for all LOGIN shells (machine-wide). The native claude install
# lives at ~/.local/bin; this guarantees login shells (SSH, etc.) find it regardless of
# whether the per-user native-installer rc edit ran. (The terminal launcher sets PATH
# itself, and t3-serve@.service hard-sets PATH in the unit.)
install -d -m 0755 /etc/profile.d
cat > /etc/profile.d/10-local-bin.sh <<'PROFILE_EOF'
# Native per-user installs (e.g. claude-code) live in ~/.local/bin — put it on PATH.
# Guarded so it never duplicates. Sourced by login shells (bash via /etc/profile; zsh
# login via /etc/zsh/zprofile -> /etc/profile).
case ":$PATH:" in
*":$HOME/.local/bin:"*) ;;
*) export PATH="$HOME/.local/bin:$PATH" ;;
esac
PROFILE_EOF
chmod 0644 /etc/profile.d/10-local-bin.sh
log "/etc/profile.d/10-local-bin.sh (~/.local/bin on PATH for login shells)"
# 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and
# ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind

View file

@ -11,6 +11,14 @@ echo " Starting Claude Code in $HOME/code ..."
echo " (Right-click for tmux menu, or Ctrl+B then | or - to split)"
echo ""
# The native claude install lives in ~/.local/bin. This launcher runs in tmux's non-login
# env, which does NOT source the user's shell rc (where the native installer added it to
# PATH) — so `claude` would appear missing here. Put it on PATH ourselves; guarded/idempotent.
case ":$PATH:" in
*":$HOME/.local/bin:"*) ;;
*) export PATH="$HOME/.local/bin:$PATH" ;;
esac
name_args=()
if [ -n "${TMUX:-}" ]; then
sess="$(tmux display-message -p '#{session_name}' 2>/dev/null)"
@ -42,14 +50,48 @@ else
done
fi
# Prefer the system-wide `claude` (installed by setup-devvm.sh); fall back to npx.
# Run the NATIVE `claude` (the recommended install: ~/.local/bin/claude, self-updating).
# No npm/npx. If the native binary is missing (a fresh account before the hourly reconcile
# has provisioned it), bootstrap it with the official native installer, then run it.
launch() {
if command -v claude >/dev/null 2>&1; then
claude "$@"
if ! command -v claude >/dev/null 2>&1; then
echo " Installing Claude Code (native) for $(id -un)"
curl -fsSL https://claude.ai/install.sh | bash || return 127
export PATH="$HOME/.local/bin:$PATH"
fi
claude "$@"
}
# Re-assert Claude Code's first-run onboarding flag before launch. ~/.claude.json is a
# SINGLE file that ALL of a user's concurrent claude processes (this terminal, their
# t3-serve instance, agent/SDK sessions) read-modify-write; a stale writer periodically
# drops top-level keys — including hasCompletedOnboarding — which throws the next
# interactive session back to the "Choose the text style" wizard even though the user is
# fully logged in (credentials live in the SEPARATE ~/.claude/.credentials.json, which is
# never affected). Idempotent, runs as the user right before launch, never clobbers other
# keys. Best-effort: no-op if jq is missing or the file is empty/corrupt (claude self-heals).
ensure_onboarding() {
command -v jq >/dev/null 2>&1 || return 0
local cfg="$HOME/.claude.json" ver tmp
ver="$(claude --version 2>/dev/null | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)"
if [ -s "$cfg" ]; then
jq -e . "$cfg" >/dev/null 2>&1 || return 0 # corrupt -> leave for claude
[ "$(jq -r '.hasCompletedOnboarding // false' "$cfg")" = "true" ] && return 0 # already set -> no write
elif [ -e "$cfg" ]; then
return 0 # empty (mid-write?) -> leave it
fi
tmp="$(mktemp "${cfg}.XXXXXX")" || return 0
if [ -f "$cfg" ]; then
jq --arg v "$ver" '.hasCompletedOnboarding = true
| (if $v != "" then .lastOnboardingVersion = $v else . end)' "$cfg" > "$tmp" 2>/dev/null \
&& chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp"
else
npx @anthropic-ai/claude-code "$@"
jq -n --arg v "$ver" '{hasCompletedOnboarding: true}
+ (if $v != "" then {lastOnboardingVersion: $v} else {} end)' > "$tmp" 2>/dev/null \
&& chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp"
fi
}
ensure_onboarding
# Deliberately not `exec` so we can branch on the exit code: clean quit ends the
# pane (ttyd closes the terminal); a crash drops to a shell so the tmux session

View file

@ -6,5 +6,5 @@ variable "tls_secret_name" {
variable "image_tag" {
type = string
default = "latest"
description = "android-emulator image tag at forgejo.viktorbarzin.me/viktor/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) -> ghcr.io/viktorbarzin/android-emulator on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build."
description = "android-emulator image tag at ghcr.io/viktorbarzin/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build."
}

189
stacks/anisette/main.tf Normal file
View file

@ -0,0 +1,189 @@
# anisette self-hosted Apple anisette-data server for SideStore/AltStore.
#
# Purpose (infra issue #40): the TripIt iOS Shell is sideloaded with SideStore
# using a free Apple ID. SideStore needs an "anisette" server to broker the
# Apple-ID auth dance; the public community anisette servers see every login,
# so we run our own. Stateless HTTP service on a stable INTERNAL endpoint
# (anisette.viktorbarzin.lan) that SideStore points at.
#
# Image: Dadoum/anisette-v3-server the de-facto standard anisette-v3 server
# for SideStore/AltStore (the same project SideStore's own docs point at).
# Upstream publishes ONLY a mutable :latest tag (no GitHub releases, no semver,
# no date/sha tags verified 2026-06-14), so we pin by MANIFEST DIGEST instead
# (immutable, honours the "never :latest" rule). DockerHub is pulled
# transparently via the registry-VM pull-through cache, same as echo/cyberchef.
# To bump: `docker buildx imagetools inspect dadoum/anisette-v3-server:latest`,
# then replace the digest below.
#
# Stateless: the container caches Apple provisioning libraries under
# /home/Alcoholic/.config/anisette-v3/lib (a regenerable download re-fetched
# if absent and per upstream issue #23 it does NOT preserve client auth across
# restarts anyway). So an emptyDir is the honest fit: keeps that path writable
# without taking on a backup-pipeline obligation. No PVC, no Vault secret.
variable "tls_secret_name" {
type = string
sensitive = true
}
resource "kubernetes_namespace" "anisette" {
metadata {
name = "anisette"
labels = {
"istio-injection" : "disabled"
tier = local.tiers.aux
}
}
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.anisette.metadata[0].name
tls_secret_name = var.tls_secret_name
}
resource "kubernetes_deployment" "anisette" {
metadata {
name = "anisette"
namespace = kubernetes_namespace.anisette.metadata[0].name
labels = {
app = "anisette"
tier = local.tiers.aux
}
}
# anisette downloads + initializes Apple's CoreADI provisioning library on
# first start (slow, memory-spiky). wait_for_rollout=false so the apply never
# blocks on and never strands out of terraform state a pod that is still
# warming up (mirrors tts/llama-cpp). Pod health is still gated by the
# readiness probe below, so the Service only routes once it's actually up.
wait_for_rollout = false
spec {
replicas = 1
selector {
match_labels = {
app = "anisette"
}
}
template {
metadata {
labels = {
app = "anisette"
}
annotations = {
# Diun notify-only watch. Upstream tags only :latest, so watch the
# digest of :latest rather than a semver pattern.
"diun.enable" = "true"
"diun.watch_repo" = "false"
"diun.include_tags" = "^latest$"
}
}
spec {
container {
# Pinned by digest upstream ships only a mutable :latest (no tags).
# The `docker.io/` prefix is REQUIRED, not cosmetic: the Kyverno
# require-trusted-registries policy allowlists `docker.io/*` but NOT a
# bare `dadoum/*` prefix (only enumerated DockerHub user repos like
# mendhak/*, mpepping/* are listed in
# stacks/kyverno/modules/kyverno/security-policies.tf). A bare
# `dadoum/anisette-v3-server@...` is denied at admission; the explicit
# docker.io/ registry matches the allowlist and still pulls via the
# 10.0.20.10 pull-through cache.
image = "docker.io/dadoum/anisette-v3-server@sha256:1e20384985d3c49965f444bef39d627768dacc39ea0dca91f2a535edb7591ba3"
name = "anisette"
port {
name = "http"
container_port = 6969
}
# The image runs as the non-root user "Alcoholic" and writes its
# provisioning-library cache here; back it with an emptyDir so the
# path is writable (stateless wiped on restart, re-downloaded).
volume_mount {
name = "provisioning-cache"
mount_path = "/home/Alcoholic/.config/anisette-v3/lib"
}
resources {
requests = {
cpu = "10m"
memory = "256Mi"
}
limits = {
# anisette downloads + initializes Apple's CoreADI provisioning
# library at startup, which spikes past 128Mi OOMKilled (exit
# 137) before it can bind :6969. 512Mi gives headroom; steady
# state is much lower.
memory = "512Mi"
}
}
readiness_probe {
http_get {
path = "/"
port = 6969
}
period_seconds = 15
initial_delay_seconds = 5
}
liveness_probe {
http_get {
path = "/"
port = 6969
}
period_seconds = 30
failure_threshold = 6
}
}
volume {
name = "provisioning-cache"
empty_dir {}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
]
}
}
resource "kubernetes_service" "anisette" {
metadata {
name = "anisette"
namespace = kubernetes_namespace.anisette.metadata[0].name
labels = {
"app" = "anisette"
}
}
spec {
selector = {
app = "anisette"
}
port {
name = "http"
port = "80"
target_port = "6969"
}
}
}
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "none": SideStore is a native iOS client it can't replay the
# Authentik forward-auth cookie dance, so Authentik would break it (same
# reasoning as android-emulator's adb). Internal-only: anisette.viktorbarzin.lan,
# allow_local_access_only locks it to the LAN, and it brokers no user data of
# ours (it just relays Apple-ID anisette data). Never publicly exposed.
auth = "none"
namespace = kubernetes_namespace.anisette.metadata[0].name
name = "anisette"
root_domain = "viktorbarzin.lan"
tls_secret_name = var.tls_secret_name
allow_local_access_only = true
ssl_redirect = false
extra_annotations = {
"gethomepage.dev/enabled" = "false"
}
}

1
stacks/anisette/secrets Symbolic link
View file

@ -0,0 +1 @@
../../secrets

View file

@ -0,0 +1,8 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}

View file

@ -49,6 +49,21 @@ resource "authentik_policy_expression" "admin_services_restriction" {
host = request.context.get("host", "")
# TripIt External containment fence (ADR-0020 in the tripit repo). Publicly
# self-enrolled TripIt users (group "TripIt External", assigned by the
# tripit-enrollment flow's user_write) may reach tripit.viktorbarzin.me and
# NOTHING else. MUST be the FIRST host-dispatch branch: it is a request.user
# predicate that must dominate every host branch below, ESPECIALLY the
# default-allow `if host not in ADMIN_ONLY_HOSTS: return True` placed after
# it, a tagged user would slip into other hosts. Safe to add: the group is
# net-new and created EMPTY, so this matches zero existing principals (no
# lockout). The fence is forward-auth ONLY; OIDC apps (Vault, Immich, )
# contain External users via their own per-app group bindings see
# docs/runbooks/tripit-external-signup.md. NEVER co-assign "TripIt External"
# to a trusted/internal user (this branch would fence them out of admin hosts).
if ak_is_group_member(request.user, name="TripIt External"):
return host == "tripit.viktorbarzin.me"
# t3 Workstation edge gate: only members of "T3 Users" may reach t3.
# Placed BEFORE the ADMIN_ONLY_HOSTS early-return (t3 is intentionally not in
# that set it must not require Home-Server-Admins, just T3 Users membership).

View file

@ -0,0 +1,22 @@
# "TripIt External" group containment anchor for publicly self-enrolled TripIt
# users (ADR-0020 in the tripit repo). Members are admitted to
# tripit.viktorbarzin.me ONLY and denied every other *.viktorbarzin.me
# forward-auth host by the prepended branch in admin-services-restriction.tf.
#
# Created EMPTY and PARENTLESS, on purpose:
# * EMPTY the no-lockout guarantee. Zero members at apply time => the
# prepended policy branch matches zero existing principals => it cannot
# change anyone's authorization (contrast authentik_group "T3 Users", which
# is created WITH members atomically because THAT gate's safety property is
# the opposite). Membership is assigned at RUNTIME by the tripit-enrollment
# flow's user_write "Create users group" option (UI-managed per the ADR
# management split). Terraform owns only the group's EXISTENCE.
# * PARENTLESS do NOT make this a child of "Allow Login Users". The sensitive
# OIDC apps gate on "Home Server Admins" / "Headscale Users" / "Wrongmove
# Users" (children of "Allow Login Users") or, for Vault, on "Allow Login
# Users" itself (bound as part of ADR-0020). Keeping External out of that
# tree is what stops these users reaching OIDC apps mirrors guest.tf, which
# keeps the guest group out of "Allow Login Users" for the same reason.
resource "authentik_group" "tripit_external" {
name = "TripIt External"
}

View file

@ -0,0 +1,27 @@
# Vault OIDC authorization fence (ADR-0020). The "Vault" Authentik application had
# NO authorization binding (audit 2026-06-15: any authenticated identity could
# complete Vault OIDC login and receive Vault's built-in `default`-policy token
# token self-management/cubbyhole, no secret access, but still more than an
# outside user should hold). Bind it to "Allow Login Users" so only established
# homelab users can log in: they inherit that base group via its children
# (Home Server Admins / Headscale Users / Wrongmove Users verified live that
# `User.all_groups()` includes the parent), while publicly self-enrolled
# "TripIt External" users (deliberately PARENTLESS, so NOT in Allow Login Users)
# are denied at the Vault consent step. Closes the one OIDC app the forward-auth
# fence cannot reach; the other sensitive OIDC apps already bind a trusted group.
#
# The Vault application itself stays UI-managed (like the other OIDC apps); this
# adds ONLY the authorization binding. policy_engine_mode on the app is "any", so
# one group binding == membership in that group is required to authorize.
#
# UUIDs are PINNED as literals: this provider version has NO
# `data "authentik_application"` data source (CI pipeline 198 failed on it), and
# both objects are UI-managed and stable. To re-fetch if either is recreated, run
# `ak shell` in the goauthentik-server pod and read
# `Application.objects.get(name="Vault").pbm_uuid` and
# `Group.objects.get(name="Allow Login Users").group_uuid`.
resource "authentik_policy_binding" "vault_allow_login_users" {
target = "fe5698e3-b6b1-4475-98fa-ce2bae22f4dd" # Authentik application "Vault" (pbm_uuid)
group = "b4823cd7-8ed8-4d2f-8f94-bc285138f853" # group "Allow Login Users" (group_uuid)
order = 0
}

View file

@ -427,7 +427,7 @@ resource "kubernetes_cron_job_v1" "mysql-backup" {
failed_jobs_history_limit = 5
schedule = "30 0 * * *"
# schedule = "* * * * *"
starting_deadline_seconds = 10
starting_deadline_seconds = 600
successful_jobs_history_limit = 10
job_template {
metadata {}
@ -519,7 +519,7 @@ resource "kubernetes_cron_job_v1" "mysql-backup-per-db" {
concurrency_policy = "Replace"
failed_jobs_history_limit = 3
schedule = "45 0 * * *"
starting_deadline_seconds = 10
starting_deadline_seconds = 600
successful_jobs_history_limit = 3
job_template {
metadata {}
@ -1607,7 +1607,12 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" {
failed_jobs_history_limit = 5
schedule = "0 0 * * *"
# schedule = "* * * * *"
starting_deadline_seconds = 10
# 600s (was 10s): a 10s deadline silently DROPPED the 2026-06-13 00:00 run
# when the CronJob controller was late at the midnight backup/IO-storm tick,
# leaving the last full dump 37h old (fired PostgreSQLBackupStale). 600s lets
# a brief controller lag still launch the job. Same fix on the other three
# dbaas backup crons (they share the midnight window).
starting_deadline_seconds = 600
successful_jobs_history_limit = 10
job_template {
metadata {}
@ -1695,7 +1700,7 @@ resource "kubernetes_cron_job_v1" "postgresql-backup-per-db" {
concurrency_policy = "Replace"
failed_jobs_history_limit = 3
schedule = "15 0 * * *"
starting_deadline_seconds = 10
starting_deadline_seconds = 600
successful_jobs_history_limit = 3
job_template {
metadata {}

View file

@ -128,7 +128,7 @@ resource "kubernetes_deployment" "f1-stream" {
}
spec {
container {
image = "forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}"
image = "ghcr.io/viktorbarzin/f1-stream:${var.image_tag}"
image_pull_policy = "Always"
name = "f1-stream"
# Right-sized 2026-06-05: was 1Gi (bundled-Chromium era). The image is

View file

@ -11,6 +11,12 @@ resource "kubernetes_namespace" "forgejo" {
"istio-injection" : "disabled"
tier = local.tiers.edge
"keel.sh/enrolled" = "true"
# Opt out of the auto-generated tier-3-edge ResourceQuota (caps
# requests.memory at 4Gi). Forgejo's own pod requests 4Gi (the
# git + OCI-registry backbone, Guaranteed QoS), which pegged that
# tier quota at 100% and fired KubeQuotaAlmostFull. The
# forgejo-specific quota below gives headroom. Same pattern as dbaas.
"resource-governance/custom-quota" = "true"
}
}
lifecycle {
@ -19,6 +25,26 @@ resource "kubernetes_namespace" "forgejo" {
}
}
# Custom ResourceQuota replaces the tier-3-edge auto quota (opted out via the
# resource-governance/custom-quota label above). requests.memory is 8Gi so the
# 4Gi Forgejo pod sits at ~50% (clears KubeQuotaAlmostFull + the healthcheck
# resourcequota check) with room for a transient migration/sidecar pod. To
# raise Forgejo's memory limit past 4Gi later, bump requests.memory here too.
resource "kubernetes_resource_quota" "forgejo" {
metadata {
name = "forgejo-quota"
namespace = kubernetes_namespace.forgejo.metadata[0].name
}
spec {
hard = {
"requests.cpu" = "4"
"requests.memory" = "8Gi"
"limits.memory" = "32Gi"
pods = "30"
}
}
}
module "tls_secret" {
source = "../../modules/kubernetes/setup_tls_secret"
namespace = kubernetes_namespace.forgejo.metadata[0].name
@ -168,19 +194,29 @@ resource "kubernetes_deployment" "forgejo" {
name = "data"
mount_path = "/data"
}
# Bumped 1Gi -> 3Gi 2026-06-09: Forgejo was OOMKilled (exit 137)
# under registry-push load from in-cluster CI builds (tripit
# buildkit pushes large layers into the OCI registry). VPA
# upperBound reads ~1.5Gi, but that's suppressed by the 1Gi cap it
# kept OOMing against size for the push spike, not steady-state.
# Bumped 1Gi -> 3Gi 2026-06-09, then 3Gi -> 4Gi 2026-06-13.
# OOMKilled again (exit 137) at the 3Gi cap on 2026-06-13 (2
# restarts; briefly took the git remote + OCI registry down and
# spiked ingress TTFB/4xx). Steady-state ~2.2Gi but it spiked past
# the 3Gi cap. 4Gi is the CEILING here: the forgejo namespace
# tier-quota caps requests.memory at 4Gi and Guaranteed QoS means
# request == limit, so a pod can request at most 4Gi. A first
# attempt at 6Gi was REJECTED (FailedCreate: exceeded quota) and
# left forgejo with 0 pods until reverted -- do NOT raise memory
# past 4Gi without ALSO raising the tier-quota. The 6/9 OOM driver
# (tripit buildkit registry pushes) is gone now that the Forgejo
# registry was frozen + emptied 2026-06-13 (ADR-0002, ghcr), so the
# remaining spike is git ops / integrity-probe catalog walk / a
# possible leak; 4Gi should suffice. If it still OOMs, raise the
# tier-quota and this limit together.
# requests=limits (Guaranteed QoS) per the repo memory convention.
resources {
requests = {
cpu = "15m"
memory = "3Gi"
memory = "4Gi"
}
limits = {
memory = "3Gi"
memory = "4Gi"
}
}
port {

View file

@ -9,7 +9,7 @@ resource "kubernetes_namespace" "health" {
metadata {
name = "health"
labels = {
tier = local.tiers.aux
tier = local.tiers.aux
"keel.sh/enrolled" = "true"
}
}
@ -128,6 +128,15 @@ resource "kubernetes_deployment" "health" {
name = "COOKIE_SECURE"
value = "true"
}
env {
# ADR-0008 (health repo): identity for the internal LAN test host.
# Only reached when no X-authentik-email header is present i.e. via
# the auth="none" test ingress below. The public host's forward-auth
# fails closed, so requests arriving there always carry the real
# header and never fall back to this value.
name = "DEV_AUTH_EMAIL"
value = "vbarzin@gmail.com"
}
volume_mount {
name = "uploads"
@ -197,6 +206,15 @@ module "ingress" {
name = "health"
tls_secret_name = var.tls_secret_name
max_body_size = "100m"
# The redesigned SPA bursts well past the default 10/50 limiter on each page
# load (shell + fonts + a 5-8 call API burst). Swap the shared limiter for a
# health-specific one (100/1000), mirroring tripit/immich/actualbudget.
# The ref MUST carry the middleware's namespace prefix: the CRD lives in the
# `traefik` ns, so it's `traefik-health-rate-limit@kubernetescrd` (same form as
# traefik-tripit-rate-limit). Without the prefix Traefik can't resolve it and
# 404s the whole router.
skip_default_rate_limit = true
extra_middlewares = ["traefik-health-rate-limit@kubernetescrd"]
extra_annotations = {
"gethomepage.dev/enabled" = "true"
"gethomepage.dev/name" = "Health"
@ -207,6 +225,30 @@ module "ingress" {
}
}
# https://health-test.viktorbarzin.lan internal LAN-only test host for
# automated/E2E testing + manual screenshots without the Authentik SSO dance
# (ADR-0008). Same `health` deployment; acts as DEV_AUTH_EMAIL=vbarzin@gmail.com.
module "ingress_test" {
source = "../../modules/kubernetes/ingress_factory"
# auth = "none": LAN-only (allow_local_access_only) test host no public
# exposure; the public health.viktorbarzin.me ingress above stays
# auth="required". No user data gate here by design it serves the real app
# as DEV_AUTH_EMAIL since no X-authentik-email is injected (ADR-0008).
auth = "none"
namespace = kubernetes_namespace.health.metadata[0].name
name = "health-test"
root_domain = "viktorbarzin.lan"
service_name = kubernetes_service.health.metadata[0].name
tls_secret_name = var.tls_secret_name
allow_local_access_only = true
ssl_redirect = false
max_body_size = "100m"
anti_ai_scraping = false
extra_annotations = {
"gethomepage.dev/enabled" = "false"
}
}
resource "kubernetes_manifest" "external_secret_db" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"

View file

@ -1,7 +1,7 @@
FROM node:22-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
RUN npm install --no-audit --no-fund
COPY . .
RUN npm run build

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,24 @@
{
"name": "k8s-portal",
"private": true,
"version": "0.0.1",
"type": "module",
"scripts": {
"dev": "vite dev",
"build": "vite build",
"preview": "vite preview",
"prepare": "svelte-kit sync || echo ''",
"check": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json",
"check:watch": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json --watch"
},
"devDependencies": {
"@sveltejs/adapter-auto": "^7.0.0",
"@sveltejs/adapter-node": "^5.5.3",
"@sveltejs/kit": "^2.50.2",
"@sveltejs/vite-plugin-svelte": "^6.2.4",
"svelte": "^5.49.2",
"svelte-check": "^4.3.6",
"typescript": "^5.9.3",
"vite": "^7.3.1"
}
}

View file

@ -9,7 +9,7 @@ resource "kubernetes_namespace" "k8s_portal" {
metadata {
name = "k8s-portal"
labels = {
tier = var.tier
tier = var.tier
"keel.sh/enrolled" = "true"
}
}
@ -40,6 +40,15 @@ resource "kubernetes_deployment" "k8s_portal" {
metadata {
name = "k8s-portal"
namespace = kubernetes_namespace.k8s_portal.metadata[0].name
# ADR-0002 / no-local-builds: image now GHA-built -> ghcr:latest
# (.github/workflows/build-k8s-portal.yml). Keel polls ghcr:latest and rolls
# this deployment (replaces the removed Woodpecker in-cluster build+deploy).
annotations = {
"keel.sh/policy" = "force"
"keel.sh/trigger" = "poll"
"keel.sh/pollSchedule" = "@every 5m"
"keel.sh/match-tag" = "true"
}
labels = {
app = "k8s-portal"
tier = var.tier
@ -66,9 +75,16 @@ resource "kubernetes_deployment" "k8s_portal" {
}
spec {
# GHCR pull secret: the ghcr-credentials Secret in this namespace is
# cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy
# (allowlisted private-ghcr namespaces only ADR-0002). Source of
# truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf.
image_pull_secrets {
name = "ghcr-credentials"
}
container {
name = "portal"
image = "viktorbarzin/k8s-portal:latest"
image = "ghcr.io/viktorbarzin/k8s-portal:latest"
port {
container_port = 3000
}
@ -121,7 +137,8 @@ resource "kubernetes_deployment" "k8s_portal" {
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA); Kyverno mutates dns_config for ndots. Reviewed 2026-04-18.
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
spec[0].template[0].spec[0].container[0].image, # CI updates image tag
spec[0].template[0].spec[0].container[0].image, # Keel manages ghcr:latest digest
metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 (Keel stamps on roll)
]
}
}
@ -172,5 +189,5 @@ module "ingress_setup_script" {
ingress_path = ["/setup/script", "/agent"]
tls_secret_name = var.tls_secret_name
# auth = "none": Setup script + agent endpoint must be curl-able without auth (no cookies preserved in automation).
auth = "none"
auth = "none"
}

View file

@ -27,6 +27,10 @@ locals {
# openclaw's install-recruiter-plugin init container pulls the PRIVATE
# ghcr.io/viktorbarzin/recruiter-responder:latest image (infra#27).
"openclaw",
# k8s-portal: last in-cluster image build, migrated to GHAghcr (ADR-0002,
# "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE
# (infra repo default); the deployment references the cloned secret.
"k8s-portal",
]
}

View file

@ -553,7 +553,7 @@ resource "kubernetes_deployment" "openclaw" {
# IfNotPresent: a cached stale :latest meant the plugin manifest
# (configSchema fix) never got pulled. An uncached SHA forces the
# pull. Bump this when the openclaw plugin in nextcloud-todos changes.
image = "forgejo.viktorbarzin.me/viktor/nextcloud-todos:f85c6de1"
image = "ghcr.io/viktorbarzin/nextcloud-todos:latest"
image_pull_policy = "Always"
command = ["sh", "-c", <<-EOT
set -eu

View file

@ -0,0 +1,136 @@
// t3-afk auto-pair dispatcher
// ----------------------------------------------------------------------------
// Replicates the devvm t3-dispatch experience for the single in-cluster T3
// instance. The ingress is Authentik-gated (auth=required), so every request
// that reaches here is already authenticated. On a cookieless *document*
// navigation we mint a one-time pairing credential (`t3 auth pairing create`)
// and exchange it at the t3 server's /api/auth/browser-session endpoint for the
// `t3_session` cookie, then 302 back — so the user never sees the manual
// /pair#token screen. Everything else (incl. WebSocket upgrades for the cockpit
// live stream + terminals) is reverse-proxied straight through to t3 serve.
//
// Single upstream, same pod (localhost) — kept dependency-free (Node stdlib).
'use strict';
const http = require('http');
const net = require('net');
const { execFile } = require('child_process');
const UPSTREAM_HOST = '127.0.0.1';
const UPSTREAM_PORT = Number(process.env.T3_UPSTREAM_PORT || 3773);
const LISTEN_PORT = Number(process.env.DISPATCHER_PORT || 8080);
const T3_BIN = process.env.T3_BIN || '/data/npm-global/bin/t3';
const BASE_DIR = process.env.T3CODE_HOME || '/data/t3';
const COOKIE = 't3_session';
const childEnv = { ...process.env, PATH: '/data/npm-global/bin:' + (process.env.PATH || ''), HOME: '/home/node' };
const hasSession = (req) =>
(req.headers.cookie || '').split(/;\s*/).some((c) => c.startsWith(COOKIE + '='));
const isDocNav = (req) => {
if (req.method !== 'GET') return false;
const dest = req.headers['sec-fetch-dest'];
if (dest) return dest === 'document';
return (req.headers['accept'] || '').includes('text/html');
};
const mintCredential = () =>
new Promise((resolve, reject) => {
execFile(
T3_BIN,
['auth', 'pairing', 'create', '--base-dir', BASE_DIR, '--ttl', '5m', '--json'],
{ env: childEnv, timeout: 15000 },
(err, stdout) => {
if (err) return reject(err);
try {
const cred = JSON.parse(stdout).credential;
cred ? resolve(cred) : reject(new Error('no credential in pairing output'));
} catch (e) {
reject(e);
}
},
);
});
const exchange = (credential) =>
new Promise((resolve, reject) => {
const body = JSON.stringify({ credential });
const r = http.request(
{
host: UPSTREAM_HOST,
port: UPSTREAM_PORT,
path: '/api/auth/browser-session',
method: 'POST',
headers: { 'content-type': 'application/json', 'content-length': Buffer.byteLength(body) },
},
(resp) => {
const setCookie = resp.headers['set-cookie'] || [];
resp.resume();
resp.on('end', () =>
resp.statusCode === 200 && setCookie.length
? resolve(setCookie)
: reject(new Error('browser-session exchange returned ' + resp.statusCode)),
);
},
);
r.on('error', reject);
r.write(body);
r.end();
});
const proxyHttp = (req, res) => {
const up = http.request(
{ host: UPSTREAM_HOST, port: UPSTREAM_PORT, path: req.url, method: req.method, headers: req.headers },
(r) => {
res.writeHead(r.statusCode, r.headers);
r.pipe(res);
},
);
up.on('error', () => {
if (!res.headersSent) res.writeHead(502);
res.end('bad gateway');
});
req.pipe(up);
};
const server = http.createServer(async (req, res) => {
if (req.url === '/healthz') {
res.writeHead(200);
return res.end('ok');
}
if (!hasSession(req) && isDocNav(req)) {
try {
const cred = await mintCredential();
const setCookie = await exchange(cred);
res.writeHead(302, { location: req.url || '/', 'set-cookie': setCookie, 'cache-control': 'no-store' });
return res.end();
} catch (err) {
// Fall through to a plain proxy; the cockpit's own /pair screen is the
// fallback if auto-pair ever fails, so we never hard-fail the request.
console.error('auto-pair failed, proxying through:', err.message);
}
}
proxyHttp(req, res);
});
// WebSocket / Upgrade passthrough — the cockpit's live orchestration stream and
// terminals need this. Reconstruct the upgrade request and splice the sockets.
server.on('upgrade', (req, socket, head) => {
const up = net.connect(UPSTREAM_PORT, UPSTREAM_HOST, () => {
up.write(
`${req.method} ${req.url} HTTP/1.1\r\n` +
Object.entries(req.headers)
.map(([k, v]) => `${k}: ${v}`)
.join('\r\n') +
'\r\n\r\n',
);
if (head && head.length) up.write(head);
socket.pipe(up);
up.pipe(socket);
});
up.on('error', () => socket.destroy());
socket.on('error', () => up.destroy());
});
server.listen(LISTEN_PORT, '0.0.0.0', () =>
console.log(`t3-afk dispatcher listening on :${LISTEN_PORT} -> ${UPSTREAM_HOST}:${UPSTREAM_PORT}`),
);

View file

@ -0,0 +1,59 @@
# issue-implementer — autonomous AFK coding agent
You are **issue-implementer**, an autonomous agent that implements ONE GitHub
issue end-to-end and lands it, with no human at the keyboard. This file is your
standing behaviour; the specific task arrives as your prompt. You run inside a
T3 Code thread in `full-access` mode (skip-permissions) — there is no one to
answer questions mid-run.
## Autonomy — non-negotiable (you will hang otherwise)
- **Never enter plan mode and never call `ExitPlanMode`.** It is intercepted and
will stall this thread forever.
- **Never ask clarifying questions / never call `AskUserQuestion`.** No human is
watching. Make the most reasonable assumption, state it in a commit/your final
message, and proceed.
- If you hit something you genuinely cannot resolve safely, **stop and write a
precise blocker report as your final message** (what you tried, what's
unresolved, what you'd need). Do not thrash. The orchestrator escalates it to a
human — that is the only "ask for help" channel you have.
## What to do
1. **Understand the task.** Your prompt contains the issue (number, what to
build, acceptance criteria). Read the issue's AGENT-BRIEF if present.
2. **Work in the prepared worktree.** You are already in a git worktree on a
branch off `master`. Read the repo's own `CLAUDE.md`, `CONTEXT.md`, and any
`docs/adr/` in the area you touch — use its domain vocabulary and respect its
decisions.
3. **Test-first (TDD).** Write a failing test that captures the desired
behaviour, make it pass, then refactor. Prefer property/parameterized tests.
Run the repo's actual test suite and get it green before you commit. Do not
test implementation details — test external behaviour.
4. **Commit.** Subject = what changed; body = why, paraphrasing the issue in
plain words. Include `Closes #<issue-number>` and the trailer
`Implemented-by: issue-implementer (AFK)`. Stage files by name — never
`git add -A`/`.`. Never skip hooks.
5. **Land it.** Push your branch to `master` (`git push origin HEAD:master`). If
the push is rejected non-fast-forward, fetch, merge `origin/master`, re-run
the tests, and push again. Pushing to `master` is the intended behaviour —
CI builds and deploys from there.
6. **Report.** Your final message is a concise summary: what you built, the
commit, and anything a reviewer should know. (CI/deploy watching and any
fix-forward/freeze handling are done by the control plane, not by you — once
you've pushed green code, your job is done.)
## Guardrails (hard limits)
- **Never force-push** to `master`.
- **Never delete PVCs/PVs**, drop database tables, or run destructive data ops.
- **Never edit Vault directly**, and never commit secrets.
- **Infrastructure changes go through Terraform/Terragrunt only** — never
`kubectl apply/edit/patch` as the final state.
- **Never use `[ci skip]`** — it hides the change from the audit feed.
- Stay within the issue's scope. Don't refactor adjacent code beyond what the
task needs.
## Done means
Tests green **and** pushed to `master`. Not "code written" — landed.

408
stacks/t3-afk/main.tf Normal file
View file

@ -0,0 +1,408 @@
# =============================================================================
# t3-afk dedicated, in-cluster T3 Code instance: the EXECUTOR + COCKPIT for the
# AFK implementation pipeline (slice #2 of claude-agent-service PRD #1).
#
# claude-agent-service (control plane) dispatches issues INTO this T3 instance
# over its orchestration HTTP API; T3 runs the issue-implementer agent in a git
# worktree and shows every worker in its cockpit. See:
# claude-agent-service/docs/2026-06-14-afk-implementation-pipeline-design.md
# claude-agent-service/docs/adr/0003-t3-thin-executor-and-cockpit.md
#
# PILOT SHORTCUT (chosen 2026-06-14): no custom-built image. We run stock
# `node:24` (the full image ships git + python3/make/g++ for node-pty) and an
# init container installs PINNED npm packages (t3@0.0.27 + the Claude CLI) onto
# the SSD PVC, cached across restarts. Formalize a digest-pinned built image
# post-GO. T3 is version-pinned (npm) and NOT Keel-enrolled.
# =============================================================================
# No plan-time Vault reads every secret flows through the ExternalSecret below
# (CLAUDE_CODE_OAUTH_TOKEN / GITHUB_TOKEN / FORGEJO_TOKEN), injected as env at
# runtime. Nothing here needs a secret value at plan time.
# Wildcard TLS secret name value comes from config.tfvars; consumed by the
# ingress factory (every stack that uses the factory declares this).
variable "tls_secret_name" {}
locals {
namespace = "t3-afk"
# Stock node base the FULL node:24 (not -slim) is buildpack-deps-based, so it
# ships git + build-essential (python3/make/g++) that node-pty + the agent need.
# Fully-qualified (docker.io/library/...) to satisfy the Kyverno
# require-trusted-registries allowlist via `docker.io/*` bare `node*` is NOT
# on the bare-DockerHub-library list (alpine*/busybox*/python* are).
image = "docker.io/library/node:24"
# Pinned npm versions installed at startup (the reproducibility anchor for the
# pilot until a digest-pinned image exists).
t3_version = "0.0.27"
claude_cli_version = "latest" # @anthropic-ai/claude-code
labels = {
app = "t3-afk"
}
}
# --- Namespace ---
resource "kubernetes_namespace" "t3_afk" {
metadata {
name = local.namespace
labels = {
tier = local.tiers.aux
}
}
}
# --- Secrets ---
# The Claude provider authenticates with CLAUDE_CODE_OAUTH_TOKEN (T3 passes the
# environment straight through to the embedded claude-agent-sdk + claude CLI).
# GITHUB_TOKEN / FORGEJO_TOKEN authenticate the agent's `git push` from worktrees
# (wired into ~/.gitconfig insteadOf rewrites in the container command).
resource "kubernetes_manifest" "external_secret" {
manifest = {
apiVersion = "external-secrets.io/v1beta1"
kind = "ExternalSecret"
metadata = {
name = "t3-afk-secrets"
namespace = local.namespace
}
spec = {
refreshInterval = "15m"
secretStoreRef = {
name = "vault-kv"
kind = "ClusterSecretStore"
}
target = { name = "t3-afk-secrets" }
data = [
{
secretKey = "CLAUDE_CODE_OAUTH_TOKEN"
remoteRef = { key = "claude-agent-service", property = "claude_oauth_token" }
},
{
secretKey = "GITHUB_TOKEN"
remoteRef = { key = "viktor", property = "github_pat" }
},
{
# Shared viktor-scoped admin PAT (also used by Woodpecker + the
# claude-agent pod). Lets the agent git push / open PRs on Forgejo.
secretKey = "FORGEJO_TOKEN"
remoteRef = { key = "ci/global", property = "forgejo_push_token" }
},
]
}
}
depends_on = [kubernetes_namespace.t3_afk]
}
# issue-implementer behaviour is intentionally NOT mounted as ~/.claude/CLAUDE.md:
# T3's SDK invocation doesn't honor it, and mounting a subPath into ~/.claude
# makes that dir root-owned and breaks the agent's Bash session-env. The control
# plane injects the behaviour as a dispatch message preamble instead;
# files/issue-implementer-CLAUDE.md is kept as the canonical source for that text.
# Auto-pair dispatcher script (run by the sidecar container below). Mirrors the
# devvm t3-dispatch: on a cookieless, Authentik-gated page load it mints a
# pairing credential and exchanges it for the t3_session cookie, so the user
# never sees the manual /pair screen. Reverse-proxies everything else (incl.
# WebSockets) to t3 serve.
resource "kubernetes_config_map" "dispatcher" {
metadata {
name = "t3-afk-dispatcher"
namespace = kubernetes_namespace.t3_afk.metadata[0].name
}
data = {
"dispatcher.js" = file("${path.module}/files/dispatcher.js")
}
}
# --- Storage ---
# SSD-NFS (small-file friendly) for the T3 base dir: state.sqlite + the
# server-signing-key (losing it invalidates every issued bearer), per-thread git
# worktrees, the npm global install, and caches. ADR 0004.
module "data" {
source = "../../modules/kubernetes/nfs_volume"
name = "t3-afk-data"
namespace = kubernetes_namespace.t3_afk.metadata[0].name
nfs_server = "192.168.1.127"
nfs_path = "/srv/nfs-ssd/t3-afk-data"
storage = "30Gi"
}
# --- Deployment ---
resource "kubernetes_deployment" "t3_afk" {
# Slow first start (image pull + npm install init + ESO secret sync) can
# exceed the default rollout-wait timeout; verify pod readiness out-of-band.
wait_for_rollout = false
metadata {
name = "t3-afk"
namespace = kubernetes_namespace.t3_afk.metadata[0].name
labels = local.labels
# keel.sh/policy=never must be a DEPLOYMENT-level annotation that's where
# Keel reads it. (A pod-template label is ignored by Keel, which is why the
# earlier attempt failed.) The cluster's Kyverno inject-keel-annotations
# policy is opt-OUT: it stamps policy=patch on any workload that doesn't
# carry its own keel.sh/policy and Keel then "patch"-downgraded
# node:24 -> node:24.0.2 (below t3@0.0.27's required node >=24.10), which
# crash-looped `t3 serve`. ADR 0003 (Keel-excluded).
annotations = {
"keel.sh/policy" = "never"
}
}
spec {
replicas = 1
# Single-writer state.sqlite never run two pods against the same base dir.
strategy {
type = "Recreate"
}
selector {
match_labels = local.labels
}
template {
metadata {
labels = local.labels
}
spec {
security_context {
run_as_user = 1000 # node
run_as_group = 1000
fs_group = 1000
}
# NFS mounts land root-owned; make /data writable by uid 1000.
init_container {
name = "fix-perms"
image = "busybox:1.37"
command = ["sh", "-c", "mkdir -p /data && chown -R 1000:1000 /data && chmod 0775 /data"]
security_context {
run_as_user = 0
}
volume_mount {
name = "data"
mount_path = "/data"
}
resources {
requests = { memory = "32Mi" }
limits = { memory = "64Mi" }
}
}
# Install pinned t3 + Claude CLI onto the PVC (cached; skipped if already
# present). Runs as uid 1000 so the install is owned by the runtime user.
init_container {
name = "install-t3"
image = local.image
command = ["bash", "-c", <<-EOF
set -e
export npm_config_cache=/data/npm-cache
export npm_config_prefix=/data/npm-global
mkdir -p /data/npm-global /data/npm-cache
if [ ! -x /data/npm-global/bin/t3 ]; then
echo "installing t3@${local.t3_version} + claude CLI ..."
npm install -g "t3@${local.t3_version}" "@anthropic-ai/claude-code@${local.claude_cli_version}"
else
echo "t3 already installed: $(/data/npm-global/bin/t3 --version 2>/dev/null || echo unknown)"
fi
EOF
]
volume_mount {
name = "data"
mount_path = "/data"
}
resources {
requests = { cpu = "200m", memory = "512Mi" }
limits = { memory = "1Gi" }
}
}
container {
name = "t3"
image = local.image
# Configure git auth for the agent's pushes, then run T3 headless.
# $$ escapes Terraform interpolation so the shell expands the env vars.
command = ["bash", "-c", <<-EOF
set -e
export PATH=/data/npm-global/bin:$$PATH
export npm_config_cache=/data/npm-cache
# git identity + token rewrites so the agent can push from worktrees.
git config --global user.name "issue-implementer (AFK)"
git config --global user.email "afk-agent@viktorbarzin.me"
git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "https://github.com/"
git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "git@github.com:"
if [ -n "$${FORGEJO_TOKEN}" ]; then
git config --global url."https://$${FORGEJO_TOKEN}@forgejo.viktorbarzin.me/".insteadOf "https://forgejo.viktorbarzin.me/"
fi
exec t3 serve --mode web --host 0.0.0.0 --port 3773 --base-dir /data/t3
EOF
]
port {
container_port = 3773
}
env_from {
secret_ref {
name = "t3-afk-secrets"
}
}
env {
name = "HOME"
value = "/home/node"
}
env {
name = "T3CODE_HOME"
value = "/data/t3"
}
# T3's API needs auth even for liveness; use a TCP probe on the port.
liveness_probe {
tcp_socket {
port = 3773
}
initial_delay_seconds = 30
period_seconds = 30
}
readiness_probe {
tcp_socket {
port = 3773
}
initial_delay_seconds = 15
period_seconds = 10
}
volume_mount {
name = "data"
mount_path = "/data"
}
# NOTE: do NOT mount anything into /home/node/.claude a subPath
# mount makes that dir root-owned, which blocks the agent (uid 1000)
# from creating its Bash session-env there and breaks ALL Bash/git for
# the agent (root cause of the 2026-06-15 "agent never commits"). T3's
# SDK invocation doesn't honor ~/.claude/CLAUDE.md anyway, so the
# issue-implementer behaviour is injected via the dispatch message
# preamble by the control plane instead.
# Burstable (tier-aux). A live agent thread (node + claude) is memory
# heavy; size for a small number of concurrent threads on this pilot
# instance. No CPU limit per cluster policy.
resources {
requests = {
cpu = "1"
memory = "2Gi"
}
# Capped at the tier-aux LimitRange max (4Gi/container). If real
# workloads OOM, opt the namespace out via the
# resource-governance/custom-limitrange label (as claude-agent-service
# does) and raise this.
limits = {
memory = "4Gi"
}
}
}
# Auto-pair dispatcher (sidecar). The Service points at this (:8080); it
# reverse-proxies to t3 serve (:3773) and injects the session cookie so
# the browser experience matches t3.viktorbarzin.me. Shares /data so it
# can exec the t3 CLI to mint pairing credentials.
container {
name = "dispatcher"
image = local.image
command = ["node", "/scripts/dispatcher.js"]
port {
container_port = 8080
}
env {
name = "HOME"
value = "/home/node"
}
readiness_probe {
http_get {
path = "/healthz"
port = 8080
}
initial_delay_seconds = 10
period_seconds = 10
}
volume_mount {
name = "data"
mount_path = "/data"
}
volume_mount {
name = "dispatcher"
mount_path = "/scripts"
}
resources {
requests = { cpu = "50m", memory = "64Mi" }
limits = { memory = "256Mi" }
}
}
volume {
name = "data"
persistent_volume_claim {
claim_name = module.data.claim_name
}
}
volume {
name = "dispatcher"
config_map {
name = kubernetes_config_map.dispatcher.metadata[0].name
}
}
}
}
}
lifecycle {
ignore_changes = [
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
# Kyverno's inject-keel-annotations stamps pollSchedule/trigger alongside
# the policy; we own keel.sh/policy=never above, but ignore these two so
# they don't perpetually drift the plan.
metadata[0].annotations["keel.sh/pollSchedule"],
metadata[0].annotations["keel.sh/trigger"],
]
}
}
# --- Service ---
resource "kubernetes_service" "t3_afk" {
metadata {
name = "t3-afk"
namespace = kubernetes_namespace.t3_afk.metadata[0].name
labels = local.labels
}
spec {
selector = local.labels
# Route to the auto-pair dispatcher sidecar (:8080), which reverse-proxies
# to t3 serve (:3773) after injecting the t3_session cookie.
port {
port = 3773
target_port = 8080
}
type = "ClusterIP"
}
}
# --- Ingress ---
# The cockpit has no built-in user auth, so Authentik forward-auth is the gate.
module "ingress" {
source = "../../modules/kubernetes/ingress_factory"
auth = "required"
dns_type = "proxied"
namespace = kubernetes_namespace.t3_afk.metadata[0].name
name = "t3-afk"
service_name = kubernetes_service.t3_afk.metadata[0].name
port = 3773
tls_secret_name = var.tls_secret_name
}

View file

@ -0,0 +1,18 @@
include "root" {
path = find_in_parent_folders()
}
dependency "platform" {
config_path = "../platform"
skip_outputs = true
}
dependency "vault" {
config_path = "../vault"
skip_outputs = true
}
dependency "external-secrets" {
config_path = "../external-secrets"
skip_outputs = true
}

View file

@ -225,8 +225,11 @@ module "ingress_ro" {
# https://forgejo.viktorbarzin.me/viktor/terminal-lobby
#
# That repo's ./scripts/deploy.sh ships everything to wizard@10.0.10.10
# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. This stack
# only owns the Kubernetes side: Services, Endpoints pointing at
# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. Deploy is
# MANUAL via that script there is no CI pipeline (the lobby's
# .woodpecker.yml was removed under ADR-0002, issue #31; it builds no
# image, so it is not part of the GHA->ghcr fleet). This stack only owns
# the Kubernetes side: Services, Endpoints pointing at
# 10.0.10.10:{7681,7682,7683,7684}, the IngressRoutes, and the Traefik
# middlewares that gate everything behind Authentik forward-auth.
#

View file

@ -344,6 +344,31 @@ resource "kubernetes_manifest" "middleware_tripit_rate_limit" {
depends_on = [helm_release.traefik]
}
# Health-specific rate limit. The redesigned, data-dense SPA loads the shell
# (JS chunks + two self-hosted Geist woff2) plus a 5-8 call API burst per page,
# and fast tab-to-tab navigation from one client IP blows past the default
# 10/50 limiter 429ing the tail so cards/pages render empty (fifth instance
# of the burst pattern, after ha-sofia, ActualBudget, noVNC and tripit). Burst
# absorbs a couple of full page loads back-to-back.
resource "kubernetes_manifest" "middleware_health_rate_limit" {
manifest = {
apiVersion = "traefik.io/v1alpha1"
kind = "Middleware"
metadata = {
name = "health-rate-limit"
namespace = kubernetes_namespace.traefik.metadata[0].name
}
spec = {
rateLimit = {
average = 100
burst = 1000
}
}
}
depends_on = [helm_release.traefik]
}
# Compress responses to clients at the entrypoint level (outermost).
# Applied at websecure entrypoint so all responses get compressed.
# Uses includedContentTypes (whitelist) instead of excludedContentTypes:

View file

@ -125,6 +125,11 @@ locals {
# (older images crash-loop on the unknown enum) landed after that
# image rolled out, same hold-order as FARE/CALENDAR/RESEARCH above.
CITY_IMAGE_PROVIDER = "wikipedia"
# Re-applied 2026-06-14: a69847a0 (the commit that added this) was never
# terraform-applied its CI run skipped the tripit stack (changed-stack
# diff race), so the env never landed in-cluster and the provider fell back
# to the fake 1x1-PNG, leaving every trip/stay cover blank. This touch forces
# the tripit stack to re-apply and reconcile the drift.
# Tour-guide content pipeline (tripit#24/#25): these three default to `fake`
# in tripit's config, which is what shipped dark on 2026-06-08 prod only
# ever showed the placeholder "Sight 1". Real providers: Wikipedia GeoSearch

View file

@ -6,5 +6,5 @@ variable "tls_secret_name" {
variable "image_tag" {
type = string
default = "latest"
description = "tuya_bridge image tag pushed to forgejo.viktorbarzin.me/viktor/tuya_bridge. Each Woodpecker run does `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)."
description = "tuya_bridge image tag at ghcr.io/viktorbarzin/tuya_bridge (built by GHA, ADR-0002). The GHA deploy job drives a Woodpecker `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)."
}

View file

@ -0,0 +1,29 @@
# Uptime Kuma — Context
Glossary for the uptime-kuma monitoring context. Terms only — no implementation
detail. Decisions live in `docs/adr/`.
## Glossary
**Active check (poll)** — Uptime Kuma actively probes a target on an interval
(HTTP / TCP / ping / DB). This is *polling*, not "scraping." Prometheus *scrapes*
exporters; Kuma *polls* targets. (Note: Prometheus does **not** scrape Kuma — a
separate monitoring lane.)
**Monitor** — one configured target plus its check definition.
**Internal monitor** — probes a service on its in-cluster address
(`*.svc.cluster.local`). Answers "is the service itself healthy?"
**`[External]` monitor** — probes a service via its full public path
(DNS → Cloudflare → cloudflared tunnel → Traefik). Answers "is the service
reachable the way users reach it?" Maintained one-per-externally-reachable-service
by deliberate choice (see ADR-0001).
**Heartbeat** — one recorded check result (up/down + latency), persisted to the
datastore.
**External-access divergence** — the condition where a service is healthy
*internally* but its `[External]` path is down — i.e. the shared
Cloudflare/tunnel/Traefik path is broken while the service itself is fine.
Surfaced by the `ExternalAccessDivergence` alert.

View file

@ -0,0 +1,45 @@
# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement
## Status
Accepted (2026-06-13)
## Context
A review was prompted by a suspicion that Kuma was "scraping too much / causing
unnecessary traffic," itself triggered by a socket.io login-timeout incident on
the monitor-sync CronJobs. Measured state at review time:
- **227 active monitors**; 209 of them at 300s intervals; **~1 check/sec** aggregate.
- Datastore: the **shared `mysql.dbaas`** (MariaDB), **~77 MB**, ~1 heartbeat
write/sec, 30-day retention.
- **122 `[External]` monitors** (full public path) + ~105 internal.
The data did **not** support a load problem — Kuma is already lean. The
login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event
loop briefly stalling), fixed separately by wrapping login in a retry — not a
load issue.
## Decisions
1. **Keep Kuma as-is; do not reflexively cut monitors or intervals.** Poll rate
(~1/s) and DB footprint (77 MB) are modest.
2. **`[External]` monitors stay per-service** (one per externally-reachable
service), **not** a small canary set. Rejected cutting to ~6-10 canaries:
although the Cloudflare → tunnel → Traefik path is shared infra that fails as a
unit, per-service external probes also catch *single-service* external
misconfig (one service's DNS / auth carve-out / route), which canaries miss.
The ~35k Cloudflare requests/day this generates is accepted for that coverage.
3. **Datastore stays on the shared `mysql.dbaas`.** Rejected moving to
self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the
single-instance MySQL it also helps monitor, including during that MySQL's
8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as
low-impact for now.
## Consequences
- All three decisions are **cheap to reverse**; revisit if measured load on
`mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This
ADR exists mainly so that review isn't re-run from scratch.
- **Known gap:** the *internal* monitor-sync creates/updates monitors but does
**not** prune orphans (the external sync does). Internal monitors for deleted
services linger and need periodic manual cleanup — e.g. the stale
"Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted
by hand on 2026-06-13. A *scoped* internal-prune (only deleting monitors the
sync owns, never hand-made ones) is a possible future improvement.

View file

@ -503,8 +503,27 @@ except (urllib.error.URLError, OSError, KeyError, ValueError) as e:
print(f"Loaded {len(targets)} external monitor targets (source={source})")
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
api.login("admin", UPTIME_KUMA_PASS)
api = None
for _login_try in range(1, 6):
try:
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
api.login("admin", UPTIME_KUMA_PASS)
break
except Exception as _login_err:
# kuma 2.x's single Node event loop intermittently stalls under its
# ~300 monitors, so the socket.io login handshake times out. Retry a
# few times across a ~60s window to ride out the stall instead of
# failing the whole sync job (which fired JobFailed -> Slack noise).
print(f"WARN: Kuma login attempt {_login_try}/5 failed: {_login_err!r}")
if api is not None:
try:
api.disconnect()
except Exception:
pass
api = None
if _login_try == 5:
raise
time.sleep(15)
monitors = api.get_monitors()
existing_external = {}
@ -818,8 +837,27 @@ UPTIME_KUMA_PASS = os.environ["UPTIME_KUMA_PASSWORD"]
with open("/config/targets.json") as f:
targets = json.load(f)
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
api.login("admin", UPTIME_KUMA_PASS)
api = None
for _login_try in range(1, 6):
try:
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
api.login("admin", UPTIME_KUMA_PASS)
break
except Exception as _login_err:
# kuma 2.x's single Node event loop intermittently stalls under its
# ~300 monitors, so the socket.io login handshake times out. Retry a
# few times across a ~60s window to ride out the stall instead of
# failing the whole sync job (which fired JobFailed -> Slack noise).
print(f"WARN: Kuma login attempt {_login_try}/5 failed: {_login_err!r}")
if api is not None:
try:
api.disconnect()
except Exception:
pass
api = None
if _login_try == 5:
raise
time.sleep(15)
existing = {m["name"]: m for m in api.get_monitors()}