Merge remote-tracking branch 'forgejo/master' into emo/fan-control-ha-actuator
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All checks were successful
ci/woodpecker/push/default Pipeline was successful
This commit is contained in:
commit
5bc3d27d1b
42 changed files with 3072 additions and 387 deletions
File diff suppressed because one or more lines are too long
|
|
@ -42,12 +42,13 @@
|
|||
| webhook_handler | Webhook processing | webhook_handler |
|
||||
| tuya-bridge | Smart home bridge | tuya-bridge |
|
||||
| android-emulator | Shared Android 16 test emulator (adb 10.0.20.200:5555, noVNC android-emulator.viktorbarzin.lan) | android-emulator |
|
||||
| anisette | Self-hosted Apple anisette-data server (Dadoum/anisette-v3-server, digest-pinned) for sideloading the TripIt iOS Shell via SideStore; internal-only http://anisette.viktorbarzin.lan, auth=none, LAN-only, stateless | anisette |
|
||||
| dawarich | Location history | dawarich |
|
||||
| owntracks | Location tracking | owntracks |
|
||||
| nextcloud | File sync/share | nextcloud |
|
||||
| calibre | E-book management (may be merged into ebooks stack) | calibre |
|
||||
| onlyoffice | Document editing | onlyoffice |
|
||||
| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05), Woodpecker-native build->deploy (repo id 166) | f1-stream |
|
||||
| f1-stream | F1 streaming (uses chrome-service for hmembeds verifier); canonical source in own repo `viktor/f1-stream` (Forgejo, extracted 2026-06-05); GHA-built → `ghcr.io/viktorbarzin/f1-stream` (private), Woodpecker deploy-only (ADR-0002) | f1-stream |
|
||||
| chrome-service | Headed Chromium over CDP (`http://chrome-service.chrome-service.svc:9222`, `connect_over_cdp`; legacy `:3000/<token>` WS pool removed 2026-06-04) for sibling services driving anti-bot pages — snapshot-harvester CronJob + tripit fare scrape | chrome-service |
|
||||
| rybbit | Analytics | rybbit |
|
||||
| isponsorblocktv | SponsorBlock for TV | isponsorblocktv |
|
||||
|
|
|
|||
36
.github/workflows/build-k8s-portal.yml
vendored
Normal file
36
.github/workflows/build-k8s-portal.yml
vendored
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
name: Build k8s-portal
|
||||
|
||||
# ADR-0002 / no-local-builds: k8s-portal (infra-owned Go portal) builds off-infra
|
||||
# on GHA → public ghcr; Keel polls ghcr:latest and rolls the deployment. Replaces
|
||||
# the in-cluster .woodpecker/k8s-portal.yml build.
|
||||
on:
|
||||
push:
|
||||
branches: [master]
|
||||
paths:
|
||||
- 'stacks/k8s-portal/modules/k8s-portal/files/**'
|
||||
workflow_dispatch: {}
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
packages: write
|
||||
|
||||
jobs:
|
||||
build:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: docker/setup-buildx-action@v3
|
||||
- uses: docker/login-action@v3
|
||||
with:
|
||||
registry: ghcr.io
|
||||
username: ${{ github.actor }}
|
||||
password: ${{ secrets.GITHUB_TOKEN }}
|
||||
- uses: docker/build-push-action@v6
|
||||
with:
|
||||
context: stacks/k8s-portal/modules/k8s-portal/files
|
||||
platforms: linux/amd64
|
||||
provenance: false
|
||||
push: true
|
||||
tags: |
|
||||
ghcr.io/viktorbarzin/k8s-portal:latest
|
||||
ghcr.io/viktorbarzin/k8s-portal:${{ github.sha }}
|
||||
|
|
@ -3,10 +3,6 @@
|
|||
"ha": {
|
||||
"type": "http",
|
||||
"url": "${HA_MCP_URL}"
|
||||
},
|
||||
"paperless": {
|
||||
"type": "http",
|
||||
"url": "http://paperless-mcp.paperless-mcp.svc.cluster.local/mcp"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1,49 +0,0 @@
|
|||
when:
|
||||
event: push
|
||||
branch: master
|
||||
path:
|
||||
include:
|
||||
- "stacks/platform/modules/k8s-portal/files/**"
|
||||
|
||||
clone:
|
||||
git:
|
||||
image: woodpeckerci/plugin-git
|
||||
settings:
|
||||
attempts: 5
|
||||
backoff: 10s
|
||||
|
||||
steps:
|
||||
- name: build-and-push
|
||||
image: woodpeckerci/plugin-docker-buildx
|
||||
settings:
|
||||
username: "viktorbarzin"
|
||||
password:
|
||||
from_secret: dockerhub-pat
|
||||
repo: viktorbarzin/k8s-portal
|
||||
dockerfile: stacks/platform/modules/k8s-portal/files/Dockerfile
|
||||
context: stacks/platform/modules/k8s-portal/files
|
||||
platforms:
|
||||
- linux/amd64
|
||||
tag: ["${CI_PIPELINE_NUMBER}", "latest"]
|
||||
cache_from: "viktorbarzin/k8s-portal:latest"
|
||||
cache_to: "type=inline"
|
||||
|
||||
- name: deploy
|
||||
image: bitnami/kubectl:latest
|
||||
commands:
|
||||
- "kubectl set image deployment/k8s-portal portal=viktorbarzin/k8s-portal:${CI_PIPELINE_NUMBER} -n k8s-portal"
|
||||
- "kubectl rollout status deployment/k8s-portal -n k8s-portal --timeout=120s"
|
||||
- "echo 'k8s-portal deployed successfully (build ${CI_PIPELINE_NUMBER})'"
|
||||
|
||||
- name: slack
|
||||
image: curlimages/curl
|
||||
commands:
|
||||
- |
|
||||
curl -s -X POST -H 'Content-type: application/json' \
|
||||
--data "{\"text\":\"K8s Portal: build #${CI_PIPELINE_NUMBER} ${CI_PIPELINE_STATUS}\"}" \
|
||||
"$SLACK_WEBHOOK" || true
|
||||
environment:
|
||||
SLACK_WEBHOOK:
|
||||
from_secret: slack_webhook
|
||||
when:
|
||||
status: [success, failure]
|
||||
10
CONTEXT.md
10
CONTEXT.md
|
|
@ -173,13 +173,17 @@ The split where every owned image is built+pushed by GitHub Actions and Woodpeck
|
|||
_Avoid_: bare "Woodpecker pipeline" — say "build" or "deploy"; "fallback build" (the in-cluster fallback path was removed by ADR-0002).
|
||||
|
||||
**Canonical repo**:
|
||||
The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included.
|
||||
_Avoid_: "upstream" (ambiguous); committing anywhere else.
|
||||
The Forgejo `viktor/<name>` repo — the only place commits land, workflow files included. Every first-party repo is Forgejo-canonical *except* an explicit set of **GitHub-first repos**. A clone keeps **only** the canonical remote (ADR-0003): the **GitHub mirror** is not a second push target.
|
||||
_Avoid_: "upstream" (ambiguous); committing anywhere else; keeping both remotes on a clone and hand-pushing to each (the dual-push habit that caused the 2026-06 divergence — ADR-0003).
|
||||
|
||||
**GitHub mirror**:
|
||||
The GitHub repo a **Canonical repo** push-mirrors to, one-way, so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync.
|
||||
The GitHub repo a **Canonical repo** push-mirrors to, one-way (Forgejo's `push_mirrors`, `sync_on_commit`), so GitHub Actions can build from it; anything committed on the mirror is silently overwritten by the next sync — and enabling the mirror **force-overwrites** the GitHub side, so a diverged GitHub-only commit must be merged back into Forgejo *before* the mirror is turned on or it is lost.
|
||||
_Avoid_: treating it as a second writable remote; bare "the GitHub repo" without saying mirror.
|
||||
|
||||
**GitHub-first repo**:
|
||||
The deliberate exception to the **Canonical repo** rule — a repo whose canonical home is GitHub, so it sits outside the mirror policy. Two kinds: third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`), and a first-party repo intentionally kept public on GitHub (`health`). Single GitHub remote, never dual-pushed.
|
||||
_Avoid_: adding a Forgejo remote "for consistency"; treating one as a **Canonical repo**.
|
||||
|
||||
**Forgejo registry**:
|
||||
Forgejo's built-in container registry — since ADR-0002 a frozen archive holding one last-known-good tag per **Service**, not a build target; owned images live on ghcr.io.
|
||||
_Avoid_: "private registry" (collides with the registry VM's pull-through caches); pushing new images to it.
|
||||
|
|
|
|||
30
docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
Normal file
30
docs/adr/0003-keep-forgejo-canonical-complete-mirror.md
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
# Keep Forgejo as the canonical forge; complete the one-way GitHub mirror instead of swapping to GitHub
|
||||
|
||||
Status: accepted (extends ADR-0002)
|
||||
|
||||
## Context
|
||||
|
||||
Repo trees kept diverging between the Forgejo **Canonical repo** (`viktor/<name>`) and its **GitHub mirror**. A 2026-06-15 audit found the cause: an *incomplete rollout* of the Forgejo→GitHub push-mirror, not anything inherent to Forgejo. 14 repos carry **both** remotes and are hand-pushed to each (`push_mirrors = 0` on Forgejo — e.g. `infra`, `finance`, `Website`), so a human forgets one side and the trees drift; the ADR-0002-onboarded repos have a working one-way mirror (`push_mirrors = 1` — e.g. `tripit`, `recruiter-responder`) and never diverge. `infra/CONTEXT.md` already says Forgejo is the only place commits land and the GitHub mirror must never be a second writable remote — practice had simply drifted from the documented model.
|
||||
|
||||
The trigger was a proposal to swap Forgejo out for GitHub entirely. The grilling reframed it: the pain (divergence) is a "two writable remotes" problem, and the stated preference is self-hosted-primary with the remote as backup.
|
||||
|
||||
## Decision
|
||||
|
||||
Do **not** swap to GitHub. Reaffirm and *complete* the model already in `CONTEXT.md`:
|
||||
|
||||
- Every first-party repo has exactly **one** push target — its **Canonical repo** on Forgejo. GitHub is a one-way push-mirror (off-site backup + the source GitHub Actions builds from). **No repo is ever dual-pushed.**
|
||||
- A small, explicit set of **GitHub-first repos** are the exception (canonical lives on GitHub, outside the mirror policy): third-party clones/forks where GitHub is genuinely upstream (`jsoncrack.com`, `snmp_exporter`, `SparkyFitness`, `agent-rules-books`, `Plotting-Your-Dream-Book`) and the deliberately-public first-party `health`.
|
||||
- `infra` is reconciled into the standard model: its GitHub-only `.github/workflows/build-*.yml` are brought onto Forgejo-canonical (inert on Forgejo, active on the mirror), then the mirror is enabled — ending the deliberate divergence while keeping Woodpecker on the Forgejo forge.
|
||||
- Enforcement is **structural**: reconciled clones keep only the Forgejo remote, so there is no GitHub remote to habitually push to; the execution rule is "push to the canonical forge only, never the mirror."
|
||||
|
||||
## Considered options
|
||||
|
||||
- **Swap to GitHub (retire Forgejo).** Rejected: takes on a hard WAN dependency for *all* git ops — including `infra`, the repo you use to *recover* from outages — plus git-crypt secrets on GitHub as primary, a Woodpecker forge migration (WP authenticates against and watches Forgejo), and GitHub private-repo CI-minute/size limits. All to fix a problem that is actually an incomplete mirror, not Forgejo's existence. Contradicts the self-hosted-primary preference.
|
||||
- **GitHub canonical, Forgejo demoted to a DR pull-mirror.** Rejected for the same WAN-dependency and forge-migration cost; unnecessary once the real cause is understood.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Divergence becomes structurally impossible — one push target per repo.
|
||||
- Forgejo stays load-bearing (canonical git + the Woodpecker forge), so every cost of the swap is avoided.
|
||||
- The GitHub-limits worry is neutralized: private code lives on Forgejo (unlimited, self-hosted); GitHub holds mirrors for CI + backup. (GitHub Free has unlimited private repos anyway; the real limits are GHA minutes and ~1 GB repo size — `travel_blog` at 1.4 GB is why it never went to GHA.)
|
||||
- One-time remediation is required and carries a data-loss footgun: the Forgejo→GitHub mirror **force-overwrites** GitHub, so for each currently-diverged repo, any GitHub-only commits must be merged into Forgejo **before** the mirror is enabled, or they are lost. Scope: the 14 dual-push repos + the `infra` reconciliation; all other repos are already single-remote and non-diverging.
|
||||
|
|
@ -108,6 +108,31 @@ All new users must use an invitation link to register. The invitation-enrollment
|
|||
|
||||
Group membership is auto-assigned from the invitation's `fixed_data` field. This prevents open registration while maintaining SSO convenience.
|
||||
|
||||
### TripIt External self-signup (open enrollment, fenced)
|
||||
|
||||
Unlike every other app, **TripIt allows open public self-signup** for people
|
||||
outside the homelab (ADR-0020 in the tripit repo; runbook
|
||||
`docs/runbooks/tripit-external-signup.md`). A dedicated public `tripit-enrollment`
|
||||
flow (email + passkey, no password) creates the account and stamps it into the
|
||||
parentless **`TripIt External`** group. Containment is two-layered:
|
||||
|
||||
- **Forward-auth apps**: a branch prepended to the `admin-services-restriction`
|
||||
catch-all policy admits `TripIt External` to `tripit.viktorbarzin.me` only and
|
||||
denies every other `auth="required"` host.
|
||||
- **OIDC apps**: that branch does NOT cover OIDC (OIDC bypasses forward-auth).
|
||||
External users are contained because every sensitive OIDC app already requires a
|
||||
trusted group they do not hold — audited 2026-06-15:
|
||||
Immich/Grafana/Linkwarden/Cloudflare Access → `Home Server Admins`, Forgejo →
|
||||
`Task Submitters`/`Forgejo Users`, Headscale → `Headscale Users`, wrongmove →
|
||||
`Wrongmove Users`. **Vault** was OPEN (any OIDC identity got a powerless
|
||||
`default`-policy token) and is bound to **`Allow Login Users`** as part of this
|
||||
change. The Kubernetes OIDC clients are OPEN but idle (apiserver rejects OIDC).
|
||||
|
||||
**Invariants**: keep `TripIt External` parentless (never under `Allow Login
|
||||
Users`); keep the catch-all branch first; never co-assign `TripIt External` to a
|
||||
trusted/internal user; the `tripit-enrollment` user_write "Create users group"
|
||||
setting is the keystone that tags every signup.
|
||||
|
||||
### OIDC Applications
|
||||
|
||||
Authentik provides OIDC for 10 applications:
|
||||
|
|
|
|||
|
|
@ -2,334 +2,378 @@
|
|||
|
||||
## Overview
|
||||
|
||||
The CI/CD pipeline uses a hybrid approach: GitHub Actions for building Docker images (providing free compute for public repos) and Woodpecker CI for deployments (leveraging cluster-internal access). Git pushes trigger GHA builds that produce Docker images with 8-character SHA tags, push to DockerHub, then POST to Woodpecker's API to trigger deployments that update Kubernetes workloads via `kubectl set image`.
|
||||
**Doctrine (ADR-0002): all image builds and CI compute run OFF-infra.** Every
|
||||
owned image is built, tested, and linted on **GitHub Actions** (free on public
|
||||
repos; 2000 free min/mo on private) and pushed to **`ghcr.io/viktorbarzin/<name>`**.
|
||||
Woodpecker is **deploy-only** — a GHA job POSTs its API with the freshly-built
|
||||
image tag and Woodpecker runs `kubectl set image` from inside the cluster.
|
||||
There are **no in-cluster image builds or CI test runs anywhere** — the
|
||||
in-cluster Woodpecker buildkit and the fallback-build pattern were removed as a
|
||||
clean cut (ADR-0002, 2026-06-13). The Forgejo container registry is **frozen
|
||||
and emptied** — break-glass only.
|
||||
|
||||
This breaks the old circular dependency (images needed to repair the cluster
|
||||
used to be built and stored *inside* it) and keeps build IO + registry pushes
|
||||
off the homelab spindle.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
A[Git Push] --> B[GitHub Actions]
|
||||
B --> C[Build Docker Image<br/>linux/amd64, 8-char SHA tag]
|
||||
C --> D[Push to DockerHub]
|
||||
D --> E[POST Woodpecker API]
|
||||
E --> F[Woodpecker Pipeline]
|
||||
F --> G[Vault K8s Auth<br/>SA JWT]
|
||||
G --> H[kubectl set image]
|
||||
H --> I[K8s Deployment]
|
||||
I --> J[Pull from DockerHub<br/>or Pull-Through Cache]
|
||||
A[git push Forgejo<br/>viktor/<repo> canonical] --> B[push-mirror sync_on_commit]
|
||||
B --> C[GitHub mirror<br/>ViktorBarzin/<repo>]
|
||||
C --> D[GitHub Actions<br/>.github/workflows/build.yml]
|
||||
D --> E[lint / test]
|
||||
E --> F[buildx linux/amd64<br/>provenance:false]
|
||||
F --> G[push ghcr.io/viktorbarzin/<name><br/>:sha8 + :latest]
|
||||
G --> H[svu tag -> Forgejo canonical]
|
||||
G --> I[POST Woodpecker deploy repo]
|
||||
I --> J[.woodpecker/deploy.yml<br/>event: manual]
|
||||
J --> K[kubectl set image<br/>in-cluster SA cluster-admin]
|
||||
K --> L[K8s Deployment<br/>pulls from ghcr]
|
||||
|
||||
K[Pull-Through Cache<br/>10.0.20.10] -.-> J
|
||||
L[forgejo.viktorbarzin.me<br/>Private Registry on Forgejo] -.-> J
|
||||
|
||||
style B fill:#2088ff
|
||||
style F fill:#4c9e47
|
||||
style K fill:#f39c12
|
||||
style D fill:#2088ff
|
||||
style J fill:#4c9e47
|
||||
style G fill:#f39c12
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Version | Location | Purpose |
|
||||
|-----------|---------|----------|---------|
|
||||
| GitHub Actions | Cloud | `.github/workflows/build-and-deploy.yml` | Build Docker images, push to DockerHub |
|
||||
| Woodpecker CI | Self-hosted | `ci.viktorbarzin.me` | Deploy to Kubernetes cluster |
|
||||
| DockerHub | Cloud | `viktorbarzin/*` | Public image registry |
|
||||
| Private Registry | Forgejo Packages | `forgejo.viktorbarzin.me/viktor` | Private container images (PAT auth, retention CronJob) — migrated from registry.viktorbarzin.me 2026-05-07 |
|
||||
| Pull-Through Cache | Custom | `10.0.20.10:5000` (docker.io)<br/>`10.0.20.10:5010` (ghcr.io) | LAN cache for remote registries |
|
||||
| Kyverno | Cluster | `kyverno` namespace | Auto-sync registry credentials to all namespaces |
|
||||
| Vault | Cluster | `vault.viktorbarzin.me` | K8s auth for Woodpecker pipelines |
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| GitHub Actions | `.github/workflows/build.yml` (per repo) | Build + lint + test + push image; trigger deploy; cut semver tag |
|
||||
| ghcr.io | `ghcr.io/viktorbarzin/*` | Container registry for ALL owned images (public + private packages) |
|
||||
| Woodpecker CI | `ci.viktorbarzin.me` | **Deploy-only** — `kubectl set image` in-cluster; plus infra applies + maintenance crons |
|
||||
| Forgejo | `forgejo.viktorbarzin.me/viktor/<repo>` | **Canonical** git source (push-mirrors to GitHub). Container registry **FROZEN** (break-glass only) |
|
||||
| Pull-Through Cache | `10.0.20.10:5000/5010/5020/5030/5040` | LAN cache for upstream registries (DockerHub, ghcr, Quay, k8s.gcr, Kyverno) |
|
||||
| Kyverno | `kyverno` namespace | Syncs `ghcr-credentials` (private-ghcr allowlist) + `registry-credentials` to namespaces |
|
||||
| Vault | `vault.viktorbarzin.me` | K8s auth for Woodpecker deploy pipelines; CI tokens in `secret/ci/global` + `secret/viktor` |
|
||||
|
||||
## How It Works
|
||||
|
||||
### Build Flow (GitHub Actions)
|
||||
### The fleet pattern (every owned app)
|
||||
|
||||
1. **Trigger**: Git push to main/master branch
|
||||
2. **Build**: GHA builds Docker image for `linux/amd64` platform only
|
||||
3. **Tag**: Image tagged with 8-character commit SHA (e.g., `viktorbarzin/app:a1b2c3d4`)
|
||||
- `:latest` tags are **never used** to prevent stale pull-through cache issues
|
||||
4. **Push**: Image pushed to DockerHub public registry
|
||||
5. **Trigger Deploy**: POST request to Woodpecker API with repo ID and commit SHA
|
||||
1. **Canonical source = Forgejo** `viktor/<repo>`. A **push-mirror**
|
||||
(`sync_on_commit`) pushes every commit to the GitHub mirror
|
||||
`ViktorBarzin/<repo>`. The `.github/workflows/build.yml` is committed on
|
||||
Forgejo and mirrors over.
|
||||
2. **GHA `build` job** (triggers `on: push: branches: [master]` ONLY — feature
|
||||
branches mirror but build/deploy nothing, the safety valve):
|
||||
- lint + test
|
||||
- `svu` computes the next `vX.Y.Z` from conventional commits and pushes the
|
||||
tag back to **canonical Forgejo** (GHA secret `FORGEJO_GIT_TOKEN` =
|
||||
write:repository PAT); `VERSION` is baked into the image
|
||||
- `docker buildx` `linux/amd64`, **`provenance: false`** (single-manifest —
|
||||
avoids the orphaned-index-children failure class), push
|
||||
`ghcr.io/viktorbarzin/<name>:<sha8>` + `:latest`
|
||||
- `delete-package-versions` keeps the newest ~10 ghcr versions
|
||||
3. **GHA `deploy` job** POSTs `ci.viktorbarzin.me/api/repos/<id>/pipelines`
|
||||
(the Woodpecker registration for the **GitHub mirror**, github-forge; GHA
|
||||
secret `WOODPECKER_TOKEN`) with `IMAGE_TAG` + `IMAGE_NAME`.
|
||||
4. **`.woodpecker/deploy.yml`** (event: **manual** only, so the raw
|
||||
Forgejo→GitHub mirror pushes don't fire a tag-less deploy) runs `kubectl set
|
||||
image deployment/<app> <container>=<image>` in-cluster. The `woodpecker-agent`
|
||||
SA is `cluster-admin`, so the `bitnami/kubectl` step needs no
|
||||
kubeconfig/RBAC. The Deployment image is in `lifecycle.ignore_changes`
|
||||
(`KEEL_IGNORE_IMAGE`) so the SHA tag sticks and `terragrunt apply` doesn't
|
||||
fight it. CronJobs in owned apps track `:latest` + `imagePullPolicy: Always`
|
||||
instead of a deploy step.
|
||||
|
||||
### Deploy Flow (Woodpecker CI)
|
||||
**Keel stays enrolled** as a redundant net (finds the deployed SHA already
|
||||
running → no-op).
|
||||
|
||||
1. **Receive Webhook**: Woodpecker API receives deployment trigger from GHA
|
||||
2. **Authenticate**: Pipeline uses Kubernetes ServiceAccount JWT to authenticate with Vault via K8s auth
|
||||
3. **Deploy**: `kubectl set image deployment/<name> <container>=viktorbarzin/<app>:<sha>`
|
||||
4. **Notify**: Slack notification on success/failure
|
||||
**Tooling**: `infra/scripts/offinfra-onboard` + `infra/scripts/offinfra-templates/`
|
||||
scaffold a repo onto this pattern (mirror, workflow, Woodpecker deploy repo,
|
||||
old-pipeline removal, default-branch flip). Mirror + workflow commits go via
|
||||
the Forgejo API over the internal Traefik LB
|
||||
(`curl --resolve forgejo.viktorbarzin.me:443:10.0.20.203`) since the devvm
|
||||
can't reach Forgejo's public hairpin.
|
||||
|
||||
### Project Migration Status
|
||||
### ghcr package visibility
|
||||
|
||||
**Migrated to GHA (8 projects)**:
|
||||
- Website
|
||||
- k8s-portal
|
||||
- claude-memory-mcp
|
||||
- apple-health-data
|
||||
- audiblez-web
|
||||
- plotting-book
|
||||
- insta2spotify
|
||||
- book-search (audiobook-search)
|
||||
| Visibility | Packages | Pull mechanism |
|
||||
|------------|----------|----------------|
|
||||
| **Public** | beadboard, nextcloud-todos, claude-agent-service, claude-memory-mcp, kms-website, freedify, tuya_bridge, x402-gateway, chrome-service-novnc, android-emulator | Anonymous |
|
||||
| **Private** | f1-stream, job-hunter, instagram-poster, payslip-ingest, wealthfolio-sync, fire-planner, recruiter-responder, tripit, infra-cli, infra-ci | `ghcr-credentials` dockerconfigjson |
|
||||
|
||||
**Woodpecker-native owned-app builds** (build + push to the Forgejo private
|
||||
registry + `kubectl set image` rollout, all in one `.woodpecker.yml`; Keel
|
||||
stays enrolled as a redundant net): `tuya_bridge`, `job-hunter`, `f1-stream`.
|
||||
`f1-stream` was extracted from this monorepo to `viktor/f1-stream` on
|
||||
2026-06-05 (Woodpecker repo id 166); the old github source is archived and its
|
||||
GHA-era Woodpecker repo (id 10) is deactivated.
|
||||
Private-image pulls use the `ghcr-credentials` dockerconfigjson, cloned by the
|
||||
kyverno stack's `sync-ghcr-credentials` ClusterPolicy to an explicit
|
||||
**ALLOWLIST** of private-ghcr namespaces only (NOT cluster-wide; source
|
||||
`stacks/kyverno/modules/kyverno/ghcr-credentials.tf`). Cred = Vault
|
||||
`secret/viktor/ghcr_pull_token` (a dedicated classic PAT scoped to
|
||||
`read:packages`, UI-minted 2026-06-15 — no longer the admin `github_pat` alias.
|
||||
GitHub has no token-mint API, so rotation is manual: re-mint the classic
|
||||
`read:packages` PAT → `vault kv patch secret/viktor ghcr_pull_token=…` →
|
||||
targeted apply `module.kyverno.kubernetes_secret.ghcr_credentials` (reads Vault;
|
||||
avoids the git-crypt `tls-secret-sync` landmine on a locked clone), which
|
||||
Kyverno then re-syncs to the allowlisted namespaces).
|
||||
|
||||
**Woodpecker-only (infra + large apps)**:
|
||||
- `travel_blog`: 5.7GB content directory exceeds GHA limits
|
||||
- Infra pipelines: require cluster access (terragrunt apply, certbot, build-cli)
|
||||
### Migrated apps (issues #13–#27)
|
||||
|
||||
### Woodpecker Pipeline Files
|
||||
f1-stream, job-hunter, tuya_bridge, beadboard, nextcloud-todos,
|
||||
claude-agent-service, claude-memory-mcp, kms-website, Freedify,
|
||||
instagram-poster, payslip-ingest, broker-sync (image name `wealthfolio-sync`),
|
||||
fire-planner, recruiter-responder, x402-gateway — plus **tripit** (the original
|
||||
pilot, 2026-06-09). Earlier public-repo apps already on GHA (Website,
|
||||
k8s-portal, apple-health-data, audiblez-web, plotting-book, insta2spotify,
|
||||
audiobook-search, council-complaints) now also land on ghcr.
|
||||
|
||||
Each project contains:
|
||||
- `.woodpecker/deploy.yml`: kubectl set image + Slack notification
|
||||
- `.woodpecker/build-fallback.yml`: Legacy full build pipeline (event: deployment, never auto-fires)
|
||||
### Infra-owned images (issues #29 / #30)
|
||||
|
||||
### Woodpecker Repository IDs
|
||||
Images owned by the infra repo build on GHA workflows **in the infra repo's own
|
||||
`.github/workflows/`** (the github↔forgejo divergence was deliberately NOT
|
||||
reconciled — the workflows were added to the GitHub lineage via PR):
|
||||
|
||||
Woodpecker API uses numeric IDs (not owner/name):
|
||||
| Image | Workflow | Destination |
|
||||
|-------|----------|-------------|
|
||||
| chrome-service-novnc | `build-chrome-service-novnc.yml` | public `ghcr.io/viktorbarzin/chrome-service-novnc` |
|
||||
| android-emulator | `build-android-emulator.yml` | public `ghcr.io/viktorbarzin/android-emulator` |
|
||||
| infra CLI | `build-cli.yml` | DockerHub `viktorbarzin/infra` (kept) + `ghcr.io/viktorbarzin/infra-cli` |
|
||||
| infra-ci | `build-infra-ci.yml` | private `ghcr.io/viktorbarzin/infra-ci` |
|
||||
|
||||
| Repo | ID |
|
||||
|------|------|
|
||||
| infra | 1 |
|
||||
| Website | 2 |
|
||||
| finance | 3 |
|
||||
| health | 4 |
|
||||
| travel_blog | 5 |
|
||||
| webhook-handler | 6 |
|
||||
| audiblez-web | 9 |
|
||||
| plotting-book | 43 |
|
||||
| claude-memory-mcp | 78 |
|
||||
| infra-onboarding | 79 |
|
||||
**`infra-ci`** is the image the `.woodpecker/default.yml` apply step and
|
||||
`drift-detection.yml` run in (proven by pipelines 165/166). `chatterbox-tts` is
|
||||
already built by tripit's GHA → ghcr.
|
||||
|
||||
### Image Registry Flow
|
||||
The Woodpecker `build-ci-image.yml` and `build-cli.yml` pipelines were
|
||||
**REMOVED**. Break-glass for infra-ci is now a manual
|
||||
`.woodpecker/breakglass-infra-ci.yml` (ghcr pull-and-save to the registry VM).
|
||||
|
||||
1. **Containerd hosts.toml** redirects pulls from docker.io and ghcr.io to pull-through cache at `10.0.20.10`
|
||||
2. **Pull-through cache** serves cached images from LAN, fetches from upstream on cache miss
|
||||
3. **Kyverno ClusterPolicy** auto-syncs `registry-credentials` Secret to all namespaces for private registry access
|
||||
4. **Private registry** has been Forgejo's built-in OCI registry at `forgejo.viktorbarzin.me/viktor/<image>` since 2026-05-07. Auth via PAT (Vault `secret/ci/global/forgejo_push_token` for push, `secret/viktor/forgejo_pull_token` for pull). The pre-migration `registry:2.8.3`-based private registry on `registry.viktorbarzin.me:5050` was the root cause of three orphan-index incidents in three weeks (2026-04-13, 2026-04-19, 2026-05-04 — see `docs/post-mortems/2026-04-19-registry-orphan-index.md` and the full migration writeup at `docs/plans/2026-05-07-forgejo-registry-consolidation-{design,plan}.md`). The five pull-through caches on `10.0.20.10` (ports 5000/5010/5020/5030/5040) stay in place for upstream registries.
|
||||
5. **Integrity probe** (`registry-integrity-probe` CronJob in `monitoring` ns, every 15m) walks `/v2/_catalog` → tags → indexes → child manifests via HEAD and pushes `registry_manifest_integrity_failures` to Pushgateway; alerts `RegistryManifestIntegrityFailure` / `RegistryIntegrityProbeStale` / `RegistryCatalogInaccessible` page on broken state. Authoritative check (HTTP API, not filesystem).
|
||||
### Forgejo container registry — FROZEN
|
||||
|
||||
### Infra Pipelines (Woodpecker-only)
|
||||
Issue #32 wiped all `viktor/*` container packages (~19G reclaimed, `/data`
|
||||
58%→20%). The registry is **break-glass-only** now; nothing pushes to it. The
|
||||
`forgejo-cleanup` CronJob stays in `DRY_RUN` (nothing to clean). Pull-through
|
||||
caches on the registry VM (`10.0.20.10`) are unchanged. See
|
||||
`docs/runbooks/forgejo-registry-breakglass.md`.
|
||||
|
||||
### Image registry / pull path
|
||||
|
||||
1. **Containerd `hosts.toml`** redirects pulls from docker.io and ghcr.io to the
|
||||
pull-through cache at `10.0.20.10` (5000 = docker.io, 5010 = ghcr.io).
|
||||
2. **Pull-through cache** serves cached images from the LAN, fetches upstream on
|
||||
a miss.
|
||||
3. **Kyverno ClusterPolicies** sync `ghcr-credentials` (private-ghcr allowlist)
|
||||
and `registry-credentials` to namespaces.
|
||||
|
||||
## Woodpecker — what it still runs
|
||||
|
||||
Woodpecker is **deploy + cluster-touching steps only**:
|
||||
|
||||
| Pipeline | File | Purpose |
|
||||
|----------|------|---------|
|
||||
| default | `.woodpecker/default.yml` | Terragrunt apply on push |
|
||||
| renew-tls | `.woodpecker/renew-tls.yml` | Certbot renewal cron |
|
||||
| build-cli | `.woodpecker/build-cli.yml` | Build and push to dual registries |
|
||||
| build-ci-image | `.woodpecker/build-ci-image.yml` | Build `infra-ci` tooling image (triggered by `ci/Dockerfile` change or manual); post-push HEADs every blob via `verify-integrity` step to catch orphan-index pushes |
|
||||
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered build for k8s-portal subdirectory |
|
||||
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` to `/opt/registry/` on `10.0.20.10` when any managed file changes; bounces containers + nginx per `docs/runbooks/registry-vm.md` |
|
||||
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE host |
|
||||
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new `docs/post-mortems/*.md` via headless Claude agent |
|
||||
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift detection |
|
||||
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
|
||||
| per-app deploy | `.woodpecker/deploy.yml` (each repo) | `kubectl set image` + Slack notify (event: **manual**) |
|
||||
| terragrunt apply | `.woodpecker/default.yml` | Changed-stacks apply on push to master (runs in `infra-ci`) |
|
||||
| certbot | `.woodpecker/renew-tls.yml` | TLS renewal cron |
|
||||
| drift-detection | `.woodpecker/drift-detection.yml` | Nightly Terraform drift (runs in `infra-ci`) |
|
||||
| provision-user | `.woodpecker/provision-user.yml` | Add namespace-owner user from Vault spec |
|
||||
| registry-config-sync | `.woodpecker/registry-config-sync.yml` | SCP `modules/docker-registry/*` → `10.0.20.10` on change |
|
||||
| pve-nfs-exports-sync | `.woodpecker/pve-nfs-exports-sync.yml` | Sync `scripts/pve-nfs-exports` → `/etc/exports` on PVE |
|
||||
| issue-automation | `.woodpecker/issue-automation.yml` | Triage + respond to `ViktorBarzin/infra` GitHub issues |
|
||||
| postmortem-todos | `.woodpecker/postmortem-todos.yml` | Auto-resolve safe TODOs from new post-mortems |
|
||||
| k8s-portal | `.woodpecker/k8s-portal.yml` | Path-filtered deploy for the portal |
|
||||
| breakglass-infra-ci | `.woodpecker/breakglass-infra-ci.yml` | **Manual** ghcr pull-and-save of infra-ci to the registry VM |
|
||||
|
||||
**No build/test pipeline exists on any repo.** Do not (re)introduce one.
|
||||
|
||||
### Woodpecker API
|
||||
|
||||
Uses **numeric repo IDs** (`/api/repos/<id>/pipelines`), NOT owner/name paths
|
||||
(those return HTML). The deploy registration for each app is the **GitHub
|
||||
mirror** repo (registered github-forge). IDs are stable across renames and must
|
||||
be looked up from the Woodpecker UI/DB.
|
||||
|
||||
### Woodpecker YAML gotchas
|
||||
|
||||
- Commands with `${VAR}:${VAR}` must be **quoted** — an unquoted `:` triggers
|
||||
YAML map parsing when the vars are empty.
|
||||
- Use `bitnami/kubectl:latest` (not pinned versions — entrypoint compatibility).
|
||||
- Global secrets must include `manual` in their events list for API-triggered
|
||||
pipelines.
|
||||
|
||||
### GitHub repo secrets
|
||||
|
||||
Per repo: `WOODPECKER_TOKEN` (POST the deploy pipeline), `FORGEJO_GIT_TOKEN`
|
||||
(write:repository PAT for the `svu` tag push). ghcr push uses the workflow's
|
||||
built-in `GITHUB_TOKEN` (`packages: write`).
|
||||
|
||||
## Infra repo CI topology
|
||||
|
||||
The infra repo runs on Woodpecker via **two** forge registrations: the Forgejo
|
||||
forge (repo id 82, registered 2026-06-08) and the legacy GitHub forge (repo id
|
||||
1). Pushes to **Forgejo** `master` fire `.woodpecker/default.yml`
|
||||
(changed-stacks terragrunt apply, in `infra-ci`) plus the `notify-nonadmin-push`
|
||||
Slack audit step. Operational facts (2026-06-10):
|
||||
|
||||
- **Webhook URL is the IN-CLUSTER service**:
|
||||
`http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...` (PATCHed
|
||||
via the Forgejo API). The Woodpecker default (`https://ci.viktorbarzin.me/...`)
|
||||
resolves to the non-proxied public A record from pods → NAT hairpin →
|
||||
intermittent `context deadline exceeded`, silently dropping push events. If
|
||||
Woodpecker "repairs" the repo it rewrites the hook back to `ci.viktorbarzin.me`
|
||||
— re-apply the in-cluster URL.
|
||||
- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
|
||||
repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`, …).
|
||||
When registering a new forge repo for infra, clone the secret set too.
|
||||
- **Empty commits defeat path filters**: a commit with no changed files makes
|
||||
Woodpecker include ALL workflow files (path conditions can't exclude), so every
|
||||
repo secret must resolve. Normal commits with real files only compile the
|
||||
matching workflows.
|
||||
|
||||
The Forgejo trigger is not fully dependable — land infra changes by pushing
|
||||
Forgejo master (as viktor), use `[ci skip]` for docs/no-op commits, and verify
|
||||
deploys via `scripts/tg` + live cluster state rather than trusting the CI
|
||||
checkmark. The two remotes have **diverged** (parallel histories under
|
||||
different SHAs); expect github pushes to reject non-fast-forward and leave them
|
||||
— never force-push.
|
||||
|
||||
## Configuration
|
||||
|
||||
### GitHub Actions
|
||||
|
||||
**File**: `.github/workflows/build-and-deploy.yml`
|
||||
### GitHub Actions (per-app `.github/workflows/build.yml`)
|
||||
|
||||
```yaml
|
||||
name: Build and Deploy
|
||||
name: build
|
||||
on:
|
||||
push:
|
||||
branches: [main, master]
|
||||
branches: [master]
|
||||
jobs:
|
||||
build:
|
||||
runs-on: ubuntu-latest
|
||||
permissions:
|
||||
contents: write # svu tag push
|
||||
packages: write # ghcr push
|
||||
steps:
|
||||
- name: Build Docker image
|
||||
run: docker build --platform linux/amd64 -t viktorbarzin/app:${SHORT_SHA} .
|
||||
- name: Push to DockerHub
|
||||
run: docker push viktorbarzin/app:${SHORT_SHA}
|
||||
- name: Trigger Woodpecker Deploy
|
||||
- uses: actions/checkout@v4
|
||||
- name: lint + test
|
||||
run: make lint test
|
||||
- name: svu tag -> Forgejo
|
||||
run: |
|
||||
curl -X POST https://ci.viktorbarzin.me/api/repos/<REPO_ID>/pipelines \
|
||||
-H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}"
|
||||
VERSION=$(svu next)
|
||||
# ... push tag to canonical Forgejo with FORGEJO_GIT_TOKEN
|
||||
- uses: docker/setup-buildx-action@v3
|
||||
- uses: docker/build-push-action@v6
|
||||
with:
|
||||
platforms: linux/amd64
|
||||
provenance: false
|
||||
push: true
|
||||
tags: |
|
||||
ghcr.io/viktorbarzin/<name>:${{ github.sha }}
|
||||
ghcr.io/viktorbarzin/<name>:latest
|
||||
deploy:
|
||||
needs: build
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Trigger Woodpecker deploy
|
||||
run: |
|
||||
curl -X POST https://ci.viktorbarzin.me/api/repos/<DEPLOY_REPO_ID>/pipelines \
|
||||
-H "Authorization: Bearer ${{ secrets.WOODPECKER_TOKEN }}" \
|
||||
-d '{"branch":"master","variables":{"IMAGE_TAG":"...","IMAGE_NAME":"..."}}'
|
||||
```
|
||||
|
||||
**Required GitHub Secrets**:
|
||||
- `DOCKERHUB_USERNAME`
|
||||
- `DOCKERHUB_TOKEN`
|
||||
- `WOODPECKER_TOKEN`
|
||||
|
||||
### Woodpecker Deploy Pipeline
|
||||
|
||||
**File**: `.woodpecker/deploy.yml`
|
||||
### Woodpecker deploy pipeline (per-app `.woodpecker/deploy.yml`)
|
||||
|
||||
```yaml
|
||||
when:
|
||||
event: [deployment]
|
||||
event: manual
|
||||
|
||||
steps:
|
||||
deploy:
|
||||
image: bitnami/kubectl:latest
|
||||
image: bitnami/kubectl:latest # uses the in-cluster woodpecker-agent SA (cluster-admin)
|
||||
commands:
|
||||
- kubectl set image deployment/app app=viktorbarzin/app:${CI_COMMIT_SHA:0:8}
|
||||
secrets: [k8s_token]
|
||||
|
||||
- "kubectl set image deployment/app app=${IMAGE_NAME}:${IMAGE_TAG} -n <ns>"
|
||||
- "kubectl rollout status deployment/app -n <ns> --timeout=300s"
|
||||
notify:
|
||||
image: plugins/slack
|
||||
settings:
|
||||
webhook: ${SLACK_WEBHOOK}
|
||||
when:
|
||||
status: [success, failure]
|
||||
```
|
||||
|
||||
**YAML Gotchas**:
|
||||
- Commands with `${VAR}:${VAR}` syntax must be quoted to prevent YAML map parsing when vars are empty
|
||||
- Use `bitnami/kubectl:latest` (not pinned versions)
|
||||
- Global secrets must be manually added to `secrets:` list in pipeline
|
||||
### CI/CD secrets sync
|
||||
|
||||
### Vault Configuration
|
||||
|
||||
**K8s Auth for Woodpecker**:
|
||||
- Woodpecker pipelines authenticate using ServiceAccount JWT
|
||||
- Vault K8s auth mount validates JWT and issues token
|
||||
- Policies grant access to secrets and dynamic credentials
|
||||
|
||||
### CI/CD Secrets Sync
|
||||
|
||||
**CronJob**: Pushes `secret/ci/global` from Vault → Woodpecker API every 6 hours
|
||||
- Keeps Woodpecker global secrets in sync with Vault
|
||||
- Runs in `woodpecker` namespace
|
||||
|
||||
## Infra repo CI (Woodpecker repo 82 — Forgejo forge)
|
||||
|
||||
The infra repo itself runs on Woodpecker via the **Forgejo** forge (repo id 82,
|
||||
registered 2026-06-08; the GitHub-side repo id 1 also remains registered).
|
||||
Pushes to `master` fire `.woodpecker/default.yml` (changed-stacks terragrunt
|
||||
apply) plus the `notify-nonadmin-push` Slack audit step (allow-then-audit
|
||||
contribution model — see `multi-tenancy.md`). Operational facts (2026-06-10):
|
||||
|
||||
- **Webhook URL is the IN-CLUSTER service**: `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...`
|
||||
(PATCHed via the Forgejo API). The Woodpecker-generated default
|
||||
(`https://ci.viktorbarzin.me/...`) resolves to the non-proxied public A
|
||||
record from pods → NAT hairpin → intermittent `context deadline exceeded`,
|
||||
silently dropping push events (found when a push produced no pipeline).
|
||||
If Woodpecker ever "repairs" the repo it will rewrite the hook back to
|
||||
`ci.viktorbarzin.me` — re-apply the in-cluster URL (or pin `ci.viktorbarzin.me`
|
||||
in the CoreDNS pod carve-out alongside forgejo).
|
||||
- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
|
||||
repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`,
|
||||
…). Repo 82 was registered without them and every all-workflow compile
|
||||
errored with `secret "registry_ssh_key" not found`. Fixed by cloning repo-1
|
||||
rows to repo 82 in the Woodpecker DB (`insert into secrets … select … where
|
||||
repo_id=1`). When registering a new forge repo for infra, clone the secret
|
||||
set too.
|
||||
- **Empty commits defeat path filters**: a commit with no changed files makes
|
||||
Woodpecker include ALL workflow files (path conditions can't exclude), so
|
||||
every repo secret must resolve. Normal commits with real files only compile
|
||||
the matching workflows.
|
||||
A CronJob in the `woodpecker` namespace pushes `secret/ci/global` from Vault →
|
||||
the Woodpecker API every 6h, keeping global secrets in sync. Woodpecker deploy
|
||||
pipelines authenticate to the cluster via the in-cluster `woodpecker-agent` SA
|
||||
(cluster-admin); Vault K8s auth backs any secret reads.
|
||||
|
||||
## Decisions & Rationale
|
||||
|
||||
### Why GitHub Actions + Woodpecker?
|
||||
### Why all builds off-infra (ADR-0002)?
|
||||
|
||||
**Alternatives considered**:
|
||||
1. **Woodpecker-only**: Simple, but wastes cluster resources on builds
|
||||
2. **GHA-only**: No cluster access, requires kubectl from outside (security risk)
|
||||
3. **Hybrid (chosen)**: GHA for compute-heavy builds (free), Woodpecker for privileged deployments (secure cluster access)
|
||||
- **Breaks the circular dependency** — the images needed to repair the cluster
|
||||
no longer live inside it (they're on ghcr, an external registry).
|
||||
- **Removes build IO + registry push load** from the contended homelab spindle.
|
||||
- GHA is free on public repos and generous on private; buildx provenance:false
|
||||
sidesteps the orphaned-index-children failure class that plagued the
|
||||
in-cluster registry.
|
||||
- **Clean cut** — no in-cluster fallback builds anywhere; one pattern,
|
||||
fleet-wide.
|
||||
|
||||
**Benefits**:
|
||||
- Free compute for builds on public repos
|
||||
- Cluster access stays internal (Woodpecker has direct K8s access)
|
||||
- Separation of concerns: build vs deploy
|
||||
### Why ghcr (not push back to Forgejo)?
|
||||
|
||||
### Why 8-Character SHA Tags (Not :latest)?
|
||||
Forgejo's container registry repeatedly orphaned OCI index children
|
||||
(2026-04-13/19, 2026-05-04, 2026-06-10) and its retention is not container-aware.
|
||||
ghcr is external (DR-safe), free for this scale, and has native multi-arch
|
||||
handling. The Forgejo registry was frozen + emptied (issue #32).
|
||||
|
||||
- Pull-through cache serves stale `:latest` tags indefinitely
|
||||
- SHA tags ensure every deployment pulls the correct image
|
||||
- 8 characters provide sufficient collision resistance (16^8 = 4.3 billion combinations)
|
||||
### Why Woodpecker stays for deploy?
|
||||
|
||||
### Why Numeric Repo IDs for Woodpecker API?
|
||||
`kubectl set image` needs in-cluster privileged access; doing it from GHA would
|
||||
mean exposing kube-apiserver or a long-lived kubeconfig. Woodpecker's
|
||||
`woodpecker-agent` SA is already cluster-admin in-cluster — the deploy step
|
||||
needs no credentials.
|
||||
|
||||
- Woodpecker API requires numeric IDs (not owner/name slugs)
|
||||
- IDs are stable across repo renames
|
||||
- Must be manually looked up from Woodpecker UI or database
|
||||
### Why `event: manual` on deploy.yml?
|
||||
|
||||
### Why linux/amd64 Only?
|
||||
The Forgejo→GitHub push-mirror sends raw, tag-less pushes to the GitHub mirror.
|
||||
If `deploy.yml` fired on `push`, every mirror sync would trigger a deploy with no
|
||||
image tag. `manual` means only the GHA `deploy` job's explicit API POST (with
|
||||
`IMAGE_TAG`) deploys.
|
||||
|
||||
- Cluster runs on x86_64 nodes only
|
||||
- ARM builds would waste time and storage
|
||||
- Multi-arch images add complexity without benefit
|
||||
### Why linux/amd64 only?
|
||||
|
||||
The cluster runs on x86_64 nodes only; ARM builds waste time and storage.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### GHA Build Fails: "denied: requested access to the resource is denied"
|
||||
### GHA build fails: ghcr push "denied"
|
||||
|
||||
**Cause**: DockerHub credentials expired or incorrect
|
||||
The workflow `GITHUB_TOKEN` needs `packages: write` permission and the package
|
||||
must allow the repo to push. Check the workflow `permissions:` block and the
|
||||
package's "Manage Actions access" settings.
|
||||
|
||||
### Image pull fails: "ErrImagePull" / "ImagePullBackOff"
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Regenerate DockerHub token
|
||||
# Update GitHub repo secrets: DOCKERHUB_USERNAME, DOCKERHUB_TOKEN
|
||||
# Public image — check the pull-through cache is up
|
||||
curl http://10.0.20.10:5010/v2/_catalog
|
||||
|
||||
# Private image — verify the ghcr-credentials Secret exists in the namespace
|
||||
kubectl get secret ghcr-credentials -n <namespace>
|
||||
# It's Kyverno-synced to an allowlist; if missing, the namespace isn't on the
|
||||
# allowlist in stacks/kyverno/modules/kyverno/ghcr-credentials.tf
|
||||
```
|
||||
|
||||
### Woodpecker Deploy Fails: "Unauthorized"
|
||||
If the cause is the internal-DNS hairpin (fresh pulls timing out on the public
|
||||
Forgejo path), see the CoreDNS `viktorbarzin.me` carve-out in
|
||||
`docs/architecture/networking.md` and `docs/runbooks/registry-vm.md`.
|
||||
|
||||
**Cause**: Vault K8s auth token expired or invalid
|
||||
### Deploy didn't happen after a push
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Restart Woodpecker pipeline (token auto-renewed)
|
||||
# Check Vault K8s auth role exists: vault read auth/kubernetes/role/woodpecker-deployer
|
||||
```
|
||||
Confirm the push was to **master** (feature branches build/deploy nothing).
|
||||
Check the GHA run completed the `deploy` job, then check Woodpecker received the
|
||||
manual pipeline (`ci.viktorbarzin.me`, the GitHub-mirror deploy repo). Verify
|
||||
live with `kubectl rollout status` — not the CI checkmark.
|
||||
|
||||
### Image Pull Fails: "ErrImagePull"
|
||||
### Woodpecker deploy fails: "YAML: did not find expected key"
|
||||
|
||||
**Cause**: Pull-through cache or registry credentials issue
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check pull-through cache is running
|
||||
curl http://10.0.20.10:5000/v2/_catalog
|
||||
|
||||
# Verify registry-credentials Secret exists in namespace
|
||||
kubectl get secret registry-credentials -n <namespace>
|
||||
|
||||
# Manually sync credentials if missing
|
||||
kubectl get secret registry-credentials -n default -o yaml | \
|
||||
sed 's/namespace: default/namespace: <namespace>/' | kubectl apply -f -
|
||||
```
|
||||
|
||||
### Woodpecker Pipeline: "YAML: did not find expected key"
|
||||
|
||||
**Cause**: Unquoted command with `${VAR}:${VAR}` syntax when VAR is empty
|
||||
|
||||
**Fix**: Quote the command:
|
||||
```yaml
|
||||
commands:
|
||||
- "kubectl set image deployment/app app=viktorbarzin/app:${SHORT_SHA}"
|
||||
```
|
||||
|
||||
### travel_blog Build Times Out on GHA
|
||||
|
||||
**Cause**: 5.7GB content directory exceeds GHA disk/time limits
|
||||
|
||||
**Fix**: Keep on Woodpecker (no migration). Build uses cluster storage and resources.
|
||||
|
||||
### CI/CD Secrets Out of Sync
|
||||
|
||||
**Cause**: CronJob failed to sync Vault → Woodpecker
|
||||
|
||||
**Fix**:
|
||||
```bash
|
||||
# Check CronJob status
|
||||
kubectl get cronjob -n woodpecker
|
||||
|
||||
# Manually trigger sync
|
||||
kubectl create job --from=cronjob/sync-secrets manual-sync -n woodpecker
|
||||
```
|
||||
Unquoted command with `${VAR}:${VAR}` syntax when a VAR is empty. Quote the
|
||||
command (see the deploy.yml example above).
|
||||
|
||||
## Related
|
||||
|
||||
- [Databases Architecture](./databases.md) — Database credentials via Vault
|
||||
- [Multi-Tenancy](./multi-tenancy.md) — Per-user Woodpecker access
|
||||
- Runbook: `../runbooks/deploy-new-app.md` — How to set up CI/CD for a new app
|
||||
- Runbook: `../runbooks/troubleshoot-image-pull.md` — Debug image pull issues
|
||||
- Vault documentation: K8s auth configuration
|
||||
- Woodpecker documentation: API reference
|
||||
- ADR: `../adr/0002-all-image-builds-off-infra-gha-ghcr.md` — the decision
|
||||
- [Databases Architecture](./databases.md) — database credentials via Vault
|
||||
- [Multi-Tenancy](./multi-tenancy.md) — per-user Woodpecker access
|
||||
- Runbook: `../runbooks/forgejo-registry-breakglass.md` — using the frozen registry
|
||||
- Runbook: `../runbooks/registry-vm.md` — pull-through cache VM + image-pull debugging
|
||||
- Onboarding tool: `../../scripts/offinfra-onboard` + `../../scripts/offinfra-templates/`
|
||||
|
|
|
|||
|
|
@ -543,6 +543,10 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
|
|||
|
||||
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked. **The managed config self-deploys from the repo** (2026-06-10): the hourly reconcile's `sync_managed_config` installs `scripts/workstation/managed-settings.json` to `/etc/claude-code/` whenever the repo copy changes — so editing the claudeMd = edit + commit, no manual install — and `refresh_codex_mirror` regenerates each user's `~/.codex/AGENTS.md` (a static mirror of the claudeMd; only files carrying the mirror header are touched, user-customized ones are left alone). Repo-level guidance (`.claude/CLAUDE.md`, `AGENTS.md`, `CONTEXT.md` in the infra repo) reaches non-admins through their auto-freshened clones — commit + push and every user has it within the hour.
|
||||
|
||||
**Onboarding state self-heals (2026-06-15):** `~/.claude.json` is a single file that ALL of a user's concurrent `claude` processes (the ttyd terminal + their `t3-serve` instance + agent/SDK sessions) read-modify-write, so a stale writer periodically drops top-level keys — including `hasCompletedOnboarding` — which bounces the next *interactive* session back to the first-run "Choose the text style" wizard even though the user is fully logged in (credentials live in the SEPARATE `~/.claude/.credentials.json`, untouched by the race; first observed for emo 2026-06-15). The launcher (`skel/start-claude.sh`) now idempotently re-asserts `hasCompletedOnboarding` (+ `lastOnboardingVersion`) in `~/.claude.json` right before it runs `claude` — merge-only, never clobbers other keys, no-op if jq is missing or the file is empty/corrupt. And since the launcher is a per-user copy that `/etc/skel` only seeds at account creation, the reconcile's new `deploy_user_launcher` step re-copies `skel/start-claude.sh` into every non-admin home (copy-if-changed) so launcher edits now reach EXISTING users within the hour — `.tmux.conf` is deliberately NOT re-copied (terminal-lobby appends its own managed section to it).
|
||||
|
||||
**Claude Code runtime — native, per-user (2026-06-15):** `claude` is the **native** install (`~/.local/bin/claude` → `~/.local/share/claude/versions/<v>`, self-updating; `installMethod: native`) — NOT npm-global or npx. It is the runtime for both the ttyd launcher and each `t3-serve` instance. `setup-devvm.sh` installs node ONLY for the `t3` CLI (not claude); per-user native claude is provisioned by the reconcile's `install_user_claude_native` (covers terminal + t3, idempotent, skip-if-present) and self-bootstrapped by `start-claude.sh` on first launch — both via the official `https://claude.ai/install.sh`. The legacy machine-wide `npm install -g @anthropic-ai/claude-code` bootstrap and the launcher's `npx` fallback were removed; existing users had already auto-migrated to native, and the npm-global dir was empty. **PATH (`~/.local/bin`, where the native binary lives):** ensured three ways — `/etc/profile.d/10-local-bin.sh` for login shells (machine-wide, fresh-user-safe), `start-claude.sh` itself (the launcher runs in tmux's non-login env that skips the user's shell rc), and `t3-serve@.service` (`Environment=PATH=…:/home/%i/.local/bin`).
|
||||
|
||||
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Its location depends on the per-user `code_layout` in `roster.yaml`: `single` (default) puts the clone AT `~/code`; `workspace` makes `~/code` a plain directory of per-project clones — the infra clone at `~/code/infra` plus each roster `repos` entry cloned from Forgejo `viktor/<name>` **as the user** (their PAT authenticates, so private repos work; clone failures WARN and retry next hour). Flipping a user to `workspace` auto-migrates their existing `~/code` clone to `~/code/infra` (local branches/dirty state survive; running processes follow the moved inode). ancamilea = workspace + `tripit` since 2026-06-10. The provisioner clones infra anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_user_clone` over every managed clone — the infra clone and any workspace repos (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN) — and also `wire_forgejo_remote`, which idempotently adds the documented `forgejo` remote + `forgejo/master` upstream to infra clones that predate that contract. `start-claude.sh` does the same freshen at session launch (10s fetch cap per repo so an offline remote never stalls the session; workspace layouts freshen each repo under `~/code`).
|
||||
|
||||
**Contribute access (per non-admin, manual — the anca/tripit PAT precedent):**
|
||||
|
|
|
|||
226
docs/runbooks/tripit-external-signup.md
Normal file
226
docs/runbooks/tripit-external-signup.md
Normal file
|
|
@ -0,0 +1,226 @@
|
|||
# Runbook — TripIt external user self-signup (email + passkey)
|
||||
|
||||
Implements ADR-0020 (tripit repo): people outside the homelab self-register to
|
||||
TripIt with **email + a passkey** (no password), are auto-tagged into the
|
||||
**`TripIt External`** Authentik group, and are fenced to `tripit.viktorbarzin.me`
|
||||
only. Audience: people Viktor knows; open public registration.
|
||||
|
||||
> **Safety model.** Containment is two-layered. (1) **Forward-auth apps** — the
|
||||
> branch in `stacks/authentik/admin-services-restriction.tf` admits `TripIt
|
||||
> External` to `tripit.viktorbarzin.me` and denies every other `auth="required"`
|
||||
> host. (2) **OIDC apps** — the branch does NOT cover OIDC (it bypasses
|
||||
> forward-auth); External users are contained because every sensitive OIDC app
|
||||
> already requires a trusted group they do not hold (audit below). The no-lockout
|
||||
> guarantee is that the group is created **empty**, so the new branch matches
|
||||
> zero existing users on day one.
|
||||
|
||||
## OIDC app authorization audit (2026-06-15, read-only)
|
||||
|
||||
A parentless `TripIt External` user holds NONE of these groups, so:
|
||||
|
||||
| OIDC app | Requires | External user |
|
||||
|---|---|---|
|
||||
| Immich, Grafana, Linkwarden, Cloudflare Access | `Home Server Admins` | DENIED ✓ |
|
||||
| Forgejo | `Task Submitters` / `Forgejo Users` | DENIED ✓ |
|
||||
| Headscale | `Headscale Users` | DENIED ✓ |
|
||||
| wrongmove | `Wrongmove Users` | DENIED ✓ |
|
||||
| **Vault** | **was OPEN** → bound to `Allow Login Users` in Step 3 | DENIED after Step 3 |
|
||||
| Kubernetes, Kubernetes Dashboard | OPEN | harmless — apiserver rejects OIDC tokens (idle) |
|
||||
| TripIt App, Public | OPEN | by design (TripIt's own provider / guest) |
|
||||
|
||||
Vault's JWT `default` role grants only Vault's built-in `default` policy (token
|
||||
self-management, cubbyhole — **no** secret access), so the pre-fix exposure was a
|
||||
near-powerless token; Step 3 closes it anyway.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight gates (STOP if any fails)
|
||||
|
||||
1. **`TripIt External` is net-new / empty** (no-lockout precondition):
|
||||
```
|
||||
kubectl -n authentik exec -i deploy/goauthentik-server -- ak shell <<'PY'
|
||||
from authentik.core.models import Group
|
||||
g = Group.objects.filter(name="TripIt External").first()
|
||||
print("exists:", bool(g), "members:", g.users.count() if g else 0)
|
||||
PY
|
||||
```
|
||||
Expect `exists: False`. If it exists with members → STOP.
|
||||
2. **Authentik image pin matches live (B5)** — the policy edit auto-applies the
|
||||
whole `authentik` stack; a stale pin re-triggers the 2026-06-10 downgrade
|
||||
boot-storm:
|
||||
```
|
||||
kubectl -n authentik get deploy -o custom-columns=N:.metadata.name,IMG:.spec.template.spec.containers[0].image
|
||||
```
|
||||
Every `goauthentik`/`ak-outpost` image tag MUST equal
|
||||
`stacks/authentik/modules/authentik/values.yaml` `global.image.tag`
|
||||
(currently `2026.2.4`). If they differ → refresh the pin first.
|
||||
|
||||
---
|
||||
|
||||
## Step 1 — Terraform (group + fence branch)
|
||||
|
||||
Already written on this branch:
|
||||
- `stacks/authentik/tripit-external.tf` — the empty, parentless group.
|
||||
- `stacks/authentik/admin-services-restriction.tf` — the prepended fence branch.
|
||||
|
||||
**Local plan gate (B4 — CI auto-applies on push with `-auto-approve`, so there is
|
||||
NO human plan review in the apply path; do it here):**
|
||||
```
|
||||
vault login -method=oidc
|
||||
cd stacks/authentik && ../../scripts/tg plan
|
||||
```
|
||||
Confirm the plan is **exactly**:
|
||||
- `+ authentik_group.tripit_external` (create)
|
||||
- `~ authentik_policy_expression.admin_services_restriction` (update in place — the
|
||||
`expression` body gains ONLY the new branch; every other line byte-identical)
|
||||
- **`Plan: 1 to add, 1 to change, 0 to destroy.`**
|
||||
|
||||
ABORT if the plan shows any destroy/replace, any `authentik_provider_*` /
|
||||
`authentik_outpost` / `authentik_flow*` / `helm_release`, or any other expression
|
||||
change.
|
||||
|
||||
**Apply** (presence-claim courtesy, then push = apply; land human-watched, B5):
|
||||
```
|
||||
~/code/scripts/presence claim stack:authentik --purpose "ADR-0020 TripIt External group + fence branch"
|
||||
# push the branch to master (this triggers CI tg apply on the authentik stack)
|
||||
```
|
||||
Watch: GHA → Woodpecker `default.yml` apply → outpost stays healthy
|
||||
(`kubectl -n authentik get endpoints ak-outpost-authentik-embedded-outpost` = 2
|
||||
IPs; an anonymous request to any `auth=required` host still 302s to Authentik).
|
||||
The branch is inert (empty group) so no access changes yet.
|
||||
|
||||
---
|
||||
|
||||
## Step 2 — Authentik SMTP (B1, BLOCKER before any flow)
|
||||
|
||||
Email verification is the **entire identity boundary** (TripIt trusts the
|
||||
Authentik email verbatim). Authentik currently has the **default/unconfigured**
|
||||
transport (`email.host = localhost`), so verification/recovery mail cannot send.
|
||||
|
||||
Add to **both** `server.env` and `worker.env` in
|
||||
`stacks/authentik/modules/authentik/values.yaml` (wire the password from a secret;
|
||||
the cluster mailserver is what TripIt already relays through —
|
||||
`mailserver.mailserver.svc`):
|
||||
```yaml
|
||||
- { name: AUTHENTIK_EMAIL__HOST, value: "mailserver.mailserver.svc" }
|
||||
- { name: AUTHENTIK_EMAIL__PORT, value: "587" }
|
||||
- { name: AUTHENTIK_EMAIL__USE_TLS, value: "true" }
|
||||
- { name: AUTHENTIK_EMAIL__FROM, value: "noreply@viktorbarzin.me" }
|
||||
- { name: AUTHENTIK_EMAIL__USERNAME, value: "<relay user>" } # confirm relay creds
|
||||
- { name: AUTHENTIK_EMAIL__PASSWORD, valueFrom: { secretKeyRef: { name: <secret>, key: <key> } } }
|
||||
```
|
||||
**Gate:** after apply, Authentik UI → System → Settings (or an Email stage) →
|
||||
**Send test email**; it must arrive. Then prove enrollment cannot complete for an
|
||||
address you do NOT control.
|
||||
|
||||
---
|
||||
|
||||
## Step 3 — Bind Vault → `Allow Login Users` (close the one open OIDC gap)
|
||||
|
||||
Authentik UI → Applications → **Vault** → bind an authorization policy requiring
|
||||
group **`Allow Login Users`** (the base group every real homelab user inherits;
|
||||
parentless `TripIt External` is excluded). This changes nothing for existing
|
||||
users and denies External users at the Vault consent step.
|
||||
Verify: an External test account (Step 6) cannot complete Vault OIDC login.
|
||||
|
||||
---
|
||||
|
||||
## Step 4 — Build the flows (Authentik UI; UI-managed per ADR split)
|
||||
|
||||
All three flows: designation as noted, no password stage.
|
||||
|
||||
**Flow `tripit-enrollment`** (Enrollment):
|
||||
| Order | Stage | Key settings |
|
||||
|---|---|---|
|
||||
| 5 | Captcha | reCAPTCHA **v2 checkbox** keys (v3/invisible fail — see `crowdsec-recaptcha-key-type`) |
|
||||
| 10 | Identification | email only; **no** `password_stage`; `sources` optional |
|
||||
| 20 | Email (verification) | activate, blocking — **before** user_write |
|
||||
| 30 | WebAuthn authenticator setup | `user_verification = required`, `resident_key = required` |
|
||||
| 40 | User Write | **`create_users_group` = `TripIt External`** (the keystone tag); `user_type = external` |
|
||||
| 50 | User Login | session as default (`weeks=4`) |
|
||||
|
||||
**Flow `tripit-login`** (Authentication, passwordless):
|
||||
Identification (sets `enrollment_flow`/`recovery_flow`) → Authenticator
|
||||
Validation (`device_classes = [webauthn]`, `user_verification = required`) → User
|
||||
Login. Prefer routing a passkey-less email to recovery over minting a credential.
|
||||
|
||||
**Flow `tripit-recovery`** (Recovery):
|
||||
Identification (`pretend_user_exists = on`) → Email (recovery link) → WebAuthn
|
||||
authenticator setup → User Login. Notify the account on recovery + new-passkey.
|
||||
|
||||
> Do **NOT** bind the `brute-force-protection` ReputationPolicy to these flows —
|
||||
> it denies anonymous users (2026-04-06 regression). The Captcha is the bot gate.
|
||||
|
||||
---
|
||||
|
||||
## Step 5 — Surface "Sign up"
|
||||
|
||||
Recommended: a **TripIt-scoped** signup link / share-invite rather than a global
|
||||
login-screen button (narrower bot surface). Enrollment URL:
|
||||
`https://authentik.viktorbarzin.me/if/flow/tripit-enrollment/`.
|
||||
|
||||
---
|
||||
|
||||
## Step 6 — Verification (before/after — "all access keeps working")
|
||||
|
||||
Hosts for the matrix (must be real `auth="required"` default-allow hosts, NOT
|
||||
`auth="app"` apps like immich/nextcloud which bypass the catch-all):
|
||||
`tripit`, `family`, `hackmd`, `health` (default-allow) + `terminal` (admin-only).
|
||||
|
||||
**Before** (capture per user, no redirect-follow; 200=ALLOW, 302→authentik/403=DENY):
|
||||
```
|
||||
COOKIE='authentik_session=<paste for this user>'; for H in tripit family hackmd health terminal; do
|
||||
printf '%-10s %s\n' "$H" "$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: $COOKIE" https://$H.viktorbarzin.me/)"; done
|
||||
```
|
||||
Representative non-admin: `kadir.tugan@gmail.com` (Wrongmove-only) → tripit/family/hackmd/health ALLOW, terminal DENY. Admin `vbarzin@gmail.com` → all ALLOW.
|
||||
|
||||
**After Step 1 apply — regression:** re-run identically; both users' results MUST
|
||||
be unchanged (diff empty).
|
||||
|
||||
**After flows — external smoke test (the security proof):** enrol a throwaway
|
||||
account via the enrollment URL (email verify + passkey). Confirm it is tagged
|
||||
`TripIt External`, then with its cookie:
|
||||
```
|
||||
for H in tripit family hackmd health terminal frigate; do printf '%-10s %s\n' "$H" \
|
||||
"$(curl -s -o /dev/null -w '%{http_code}' --max-redirs 0 -H "Cookie: authentik_session=<external>" https://$H.viktorbarzin.me/)"; done
|
||||
```
|
||||
Expect **tripit=200, every other host DENY** (family/hackmd/health were ALLOW for
|
||||
kadir — the contrast is the fence proof). Then:
|
||||
- **OIDC containment:** with the external account, attempt OIDC login to Vault,
|
||||
Immich, Forgejo, Grafana → each must be DENIED at the app's own login.
|
||||
- **Auto-provision:** the TripIt `users` row exists (CNPG primary in ns `dbaas`:
|
||||
`select id,email from tripit.users where email='<throwaway>'`).
|
||||
- **Walling-off guard** `AuthentikWallingOffPublicPath` stays green.
|
||||
|
||||
**Any 200 on a non-tripit host, or any OIDC app admitting the external account →
|
||||
ROLLBACK.**
|
||||
|
||||
---
|
||||
|
||||
## Step 7 — Standing regression probe (recommended)
|
||||
|
||||
Add a permanent `TripIt External` identity to the `blackbox-exporter` guard
|
||||
(`stacks/monitoring/.../authentik_walloff_probe.tf` pattern): assert 200 on
|
||||
`tripit.viktorbarzin.me` AND DENY on `family.viktorbarzin.me`. This converts the
|
||||
"branch stays first" and "user_write keeps the keystone tag" invariants into
|
||||
automated `#security` alerts.
|
||||
|
||||
---
|
||||
|
||||
## Rollback
|
||||
|
||||
Revert the `admin-services-restriction.tf` expression (delete the branch) and push
|
||||
(= apply); removing a prepended `if g: return …` is behaviour-preserving on
|
||||
non-members, restoring prior authz. Disable/delete the throwaway external account
|
||||
(with the branch gone, a tagged account falls into default-allow). The empty group
|
||||
may stay (harmless). Plan-gate the revert too.
|
||||
|
||||
## Operational invariants
|
||||
|
||||
- `TripIt External` stays **parentless** (never under `Allow Login Users`).
|
||||
- The fence branch stays **first** in `admin-services-restriction`.
|
||||
- **Never** co-assign `TripIt External` to a trusted/internal user.
|
||||
- The `tripit-enrollment` user_write **`create_users_group`** setting is the
|
||||
keystone — re-verify after any flow edit (clearing it makes UNtagged accounts
|
||||
that fall into default-allow).
|
||||
- Authentik SMTP is a live dependency of enrollment + recovery.
|
||||
|
|
@ -270,6 +270,43 @@ install_user_claude_token() {
|
|||
log "shared Claude token -> $user (t3-serve env; restart needed to take effect)"
|
||||
}
|
||||
|
||||
# Re-deploy the managed per-user Claude launcher to ~/start-claude.sh. /etc/skel only
|
||||
# seeds it at account creation (setup-devvm.sh), so without this a launcher edit never
|
||||
# reaches EXISTING users — they keep running a stale copy. Copy-if-changed from the repo's
|
||||
# skel/, owned by the user, 0755. (We deliberately do NOT re-copy .tmux.conf: terminal-lobby
|
||||
# appends a managed persistence section to each user's ~/.tmux.conf that a re-copy would clobber.)
|
||||
deploy_user_launcher() {
|
||||
local user="$1" home src dst
|
||||
src="$WORKSTATION_DIR/skel/start-claude.sh"
|
||||
home="$(getent passwd "$user" | cut -d: -f6)"
|
||||
[[ -n "$home" && -d "$home" && -f "$src" ]] || return 0
|
||||
dst="$home/start-claude.sh"
|
||||
cmp -s "$src" "$dst" 2>/dev/null && return 0 # already current -> no churn
|
||||
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] deploy launcher -> $dst"; return 0; fi
|
||||
install -m 0755 "$src" "$dst"
|
||||
chown "$user:$user" "$dst"
|
||||
log "deployed start-claude.sh -> $user"
|
||||
}
|
||||
|
||||
# Ensure the per-user NATIVE claude install (the recommended runtime: ~user/.local/bin/claude,
|
||||
# self-updating) — used by BOTH the terminal launcher AND the user's t3-serve instance. We do
|
||||
# NOT npm-install claude system-wide (npm/npx isn't the recommended runtime); each user gets
|
||||
# their own native install. Idempotent: skip if already present. Runs the official native
|
||||
# installer AS the user (into their ~/.local). Best-effort: a failure WARNs and retries next
|
||||
# reconcile (start-claude.sh also self-bootstraps the terminal path).
|
||||
install_user_claude_native() {
|
||||
local user="$1" home
|
||||
home="$(getent passwd "$user" | cut -d: -f6)"
|
||||
[[ -n "$home" && -d "$home" ]] || return 0
|
||||
[[ -x "$home/.local/bin/claude" ]] && return 0 # already native -> done
|
||||
if [[ "$DRY_RUN" == 1 ]]; then echo "[dry-run] native claude install -> $user"; return 0; fi
|
||||
if runuser -u "$user" -- bash -lc 'curl -fsSL https://claude.ai/install.sh | bash' >/dev/null 2>&1; then
|
||||
log "installed native claude -> $user"
|
||||
else
|
||||
log "WARN: native claude install failed for $user (retries next reconcile)"
|
||||
fi
|
||||
}
|
||||
|
||||
[[ $EUID -eq 0 ]] || { echo "t3-provision-users: must run as root" >&2; exit 1; }
|
||||
for bin in python3 jq; do command -v "$bin" >/dev/null || { echo "missing $bin" >&2; exit 1; }; done
|
||||
[[ -f "$ROSTER" && -f "$ENGINE" ]] || { echo "roster/engine not under $WORKSTATION_DIR" >&2; exit 1; }
|
||||
|
|
@ -346,8 +383,10 @@ while IFS=$'\t' read -r os_user tier shell groups_csv code_layout repos_csv; do
|
|||
fi
|
||||
install_user_kubeconfig "$os_user"
|
||||
install_user_claude_token "$os_user"
|
||||
deploy_user_launcher "$os_user" # keep ~/start-claude.sh current (skel only seeds new accounts)
|
||||
fi
|
||||
refresh_codex_mirror "$os_user" # all tiers — mirror of the managed claudeMd
|
||||
install_user_claude_native "$os_user" # all tiers — per-user native claude (terminal + t3); no npm/npx
|
||||
done < <(jq -r '.accounts[] | [.os_user, .tier, .shell, (if (.groups|length)==0 then "-" else (.groups|join(",")) end), .code_layout, (if (.repos|length)==0 then "-" else (.repos|join(",")) end)] | @tsv' "$desired_file")
|
||||
|
||||
# 5) per-user .env (sticky port) + enable t3-serve@
|
||||
|
|
|
|||
|
|
@ -21,7 +21,13 @@ export DEBIAN_FRONTEND=noninteractive
|
|||
apt-get update -qq
|
||||
apt-get install -y "${PKGS[@]}" >/dev/null
|
||||
|
||||
# 2) node >= 18 + claude-code (claude-code requires node >= 18)
|
||||
# 2) node >= 18 — needed for the t3 CLI (npm-global, below). NOT for claude-code:
|
||||
# claude-code is the per-user NATIVE install (the recommended, self-updating
|
||||
# ~/.local/bin/claude), provisioned per user by t3-provision-users
|
||||
# (install_user_claude_native) and self-bootstrapped by start-claude.sh on first launch.
|
||||
# We deliberately do NOT `npm install -g @anthropic-ai/claude-code` — npm/npx is not the
|
||||
# recommended runtime, and a system-wide npm copy just shadows/duplicates the per-user
|
||||
# native installs everyone auto-migrates to anyway.
|
||||
need_node=1
|
||||
if command -v node >/dev/null; then
|
||||
[[ "$(node -v | sed 's/^v\([0-9]*\).*/\1/')" -ge 18 ]] && need_node=0
|
||||
|
|
@ -31,14 +37,23 @@ if [[ $need_node -eq 1 ]]; then
|
|||
curl -fsSL https://deb.nodesource.com/setup_22.x | bash - >/dev/null
|
||||
apt-get install -y nodejs >/dev/null
|
||||
fi
|
||||
# Detect the GLOBAL npm package, NOT whatever `claude` resolves to on PATH: the admin's
|
||||
# personal ~/.local/bin/claude shadows it, so `command -v claude` silently skipped the
|
||||
# system-wide install — leaving /usr/lib/node_modules/@anthropic-ai empty and fresh
|
||||
# non-admins with no claude (they only worked because the admin's install was on PATH).
|
||||
if ! npm ls -g --depth=0 @anthropic-ai/claude-code >/dev/null 2>&1; then
|
||||
log "npm: installing @anthropic-ai/claude-code (system-wide)"
|
||||
npm install -g @anthropic-ai/claude-code >/dev/null
|
||||
fi
|
||||
|
||||
# 2a) ~/.local/bin on PATH for all LOGIN shells (machine-wide). The native claude install
|
||||
# lives at ~/.local/bin; this guarantees login shells (SSH, etc.) find it regardless of
|
||||
# whether the per-user native-installer rc edit ran. (The terminal launcher sets PATH
|
||||
# itself, and t3-serve@.service hard-sets PATH in the unit.)
|
||||
install -d -m 0755 /etc/profile.d
|
||||
cat > /etc/profile.d/10-local-bin.sh <<'PROFILE_EOF'
|
||||
# Native per-user installs (e.g. claude-code) live in ~/.local/bin — put it on PATH.
|
||||
# Guarded so it never duplicates. Sourced by login shells (bash via /etc/profile; zsh
|
||||
# login via /etc/zsh/zprofile -> /etc/profile).
|
||||
case ":$PATH:" in
|
||||
*":$HOME/.local/bin:"*) ;;
|
||||
*) export PATH="$HOME/.local/bin:$PATH" ;;
|
||||
esac
|
||||
PROFILE_EOF
|
||||
chmod 0644 /etc/profile.d/10-local-bin.sh
|
||||
log "/etc/profile.d/10-local-bin.sh (~/.local/bin on PATH for login shells)"
|
||||
|
||||
# 2b) t3 (the per-user coding surface) — PINNED, never nightly/latest. t3 is pre-1.0 and
|
||||
# ships breaking auth-schema + bootstrap-API changes our t3-dispatch can't follow blind
|
||||
|
|
|
|||
|
|
@ -11,6 +11,14 @@ echo " Starting Claude Code in $HOME/code ..."
|
|||
echo " (Right-click for tmux menu, or Ctrl+B then | or - to split)"
|
||||
echo ""
|
||||
|
||||
# The native claude install lives in ~/.local/bin. This launcher runs in tmux's non-login
|
||||
# env, which does NOT source the user's shell rc (where the native installer added it to
|
||||
# PATH) — so `claude` would appear missing here. Put it on PATH ourselves; guarded/idempotent.
|
||||
case ":$PATH:" in
|
||||
*":$HOME/.local/bin:"*) ;;
|
||||
*) export PATH="$HOME/.local/bin:$PATH" ;;
|
||||
esac
|
||||
|
||||
name_args=()
|
||||
if [ -n "${TMUX:-}" ]; then
|
||||
sess="$(tmux display-message -p '#{session_name}' 2>/dev/null)"
|
||||
|
|
@ -42,14 +50,48 @@ else
|
|||
done
|
||||
fi
|
||||
|
||||
# Prefer the system-wide `claude` (installed by setup-devvm.sh); fall back to npx.
|
||||
# Run the NATIVE `claude` (the recommended install: ~/.local/bin/claude, self-updating).
|
||||
# No npm/npx. If the native binary is missing (a fresh account before the hourly reconcile
|
||||
# has provisioned it), bootstrap it with the official native installer, then run it.
|
||||
launch() {
|
||||
if command -v claude >/dev/null 2>&1; then
|
||||
claude "$@"
|
||||
if ! command -v claude >/dev/null 2>&1; then
|
||||
echo " Installing Claude Code (native) for $(id -un) …"
|
||||
curl -fsSL https://claude.ai/install.sh | bash || return 127
|
||||
export PATH="$HOME/.local/bin:$PATH"
|
||||
fi
|
||||
claude "$@"
|
||||
}
|
||||
|
||||
# Re-assert Claude Code's first-run onboarding flag before launch. ~/.claude.json is a
|
||||
# SINGLE file that ALL of a user's concurrent claude processes (this terminal, their
|
||||
# t3-serve instance, agent/SDK sessions) read-modify-write; a stale writer periodically
|
||||
# drops top-level keys — including hasCompletedOnboarding — which throws the next
|
||||
# interactive session back to the "Choose the text style" wizard even though the user is
|
||||
# fully logged in (credentials live in the SEPARATE ~/.claude/.credentials.json, which is
|
||||
# never affected). Idempotent, runs as the user right before launch, never clobbers other
|
||||
# keys. Best-effort: no-op if jq is missing or the file is empty/corrupt (claude self-heals).
|
||||
ensure_onboarding() {
|
||||
command -v jq >/dev/null 2>&1 || return 0
|
||||
local cfg="$HOME/.claude.json" ver tmp
|
||||
ver="$(claude --version 2>/dev/null | grep -oE '[0-9]+\.[0-9]+\.[0-9]+' | head -1)"
|
||||
if [ -s "$cfg" ]; then
|
||||
jq -e . "$cfg" >/dev/null 2>&1 || return 0 # corrupt -> leave for claude
|
||||
[ "$(jq -r '.hasCompletedOnboarding // false' "$cfg")" = "true" ] && return 0 # already set -> no write
|
||||
elif [ -e "$cfg" ]; then
|
||||
return 0 # empty (mid-write?) -> leave it
|
||||
fi
|
||||
tmp="$(mktemp "${cfg}.XXXXXX")" || return 0
|
||||
if [ -f "$cfg" ]; then
|
||||
jq --arg v "$ver" '.hasCompletedOnboarding = true
|
||||
| (if $v != "" then .lastOnboardingVersion = $v else . end)' "$cfg" > "$tmp" 2>/dev/null \
|
||||
&& chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp"
|
||||
else
|
||||
npx @anthropic-ai/claude-code "$@"
|
||||
jq -n --arg v "$ver" '{hasCompletedOnboarding: true}
|
||||
+ (if $v != "" then {lastOnboardingVersion: $v} else {} end)' > "$tmp" 2>/dev/null \
|
||||
&& chmod 600 "$tmp" && mv "$tmp" "$cfg" || rm -f "$tmp"
|
||||
fi
|
||||
}
|
||||
ensure_onboarding
|
||||
|
||||
# Deliberately not `exec` so we can branch on the exit code: clean quit ends the
|
||||
# pane (ttyd closes the terminal); a crash drops to a shell so the tmux session
|
||||
|
|
|
|||
|
|
@ -6,5 +6,5 @@ variable "tls_secret_name" {
|
|||
variable "image_tag" {
|
||||
type = string
|
||||
default = "latest"
|
||||
description = "android-emulator image tag at forgejo.viktorbarzin.me/viktor/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) -> ghcr.io/viktorbarzin/android-emulator on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build."
|
||||
description = "android-emulator image tag at ghcr.io/viktorbarzin/android-emulator. Built by GHA (.github/workflows/build-android-emulator.yml) on changes to stacks/android-emulator/docker/ (ADR-0002). :latest tracks the newest build."
|
||||
}
|
||||
|
|
|
|||
189
stacks/anisette/main.tf
Normal file
189
stacks/anisette/main.tf
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
# anisette — self-hosted Apple anisette-data server for SideStore/AltStore.
|
||||
#
|
||||
# Purpose (infra issue #40): the TripIt iOS Shell is sideloaded with SideStore
|
||||
# using a free Apple ID. SideStore needs an "anisette" server to broker the
|
||||
# Apple-ID auth dance; the public community anisette servers see every login,
|
||||
# so we run our own. Stateless HTTP service on a stable INTERNAL endpoint
|
||||
# (anisette.viktorbarzin.lan) that SideStore points at.
|
||||
#
|
||||
# Image: Dadoum/anisette-v3-server — the de-facto standard anisette-v3 server
|
||||
# for SideStore/AltStore (the same project SideStore's own docs point at).
|
||||
# Upstream publishes ONLY a mutable :latest tag (no GitHub releases, no semver,
|
||||
# no date/sha tags — verified 2026-06-14), so we pin by MANIFEST DIGEST instead
|
||||
# (immutable, honours the "never :latest" rule). DockerHub is pulled
|
||||
# transparently via the registry-VM pull-through cache, same as echo/cyberchef.
|
||||
# To bump: `docker buildx imagetools inspect dadoum/anisette-v3-server:latest`,
|
||||
# then replace the digest below.
|
||||
#
|
||||
# Stateless: the container caches Apple provisioning libraries under
|
||||
# /home/Alcoholic/.config/anisette-v3/lib (a regenerable download — re-fetched
|
||||
# if absent — and per upstream issue #23 it does NOT preserve client auth across
|
||||
# restarts anyway). So an emptyDir is the honest fit: keeps that path writable
|
||||
# without taking on a backup-pipeline obligation. No PVC, no Vault secret.
|
||||
|
||||
variable "tls_secret_name" {
|
||||
type = string
|
||||
sensitive = true
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "anisette" {
|
||||
metadata {
|
||||
name = "anisette"
|
||||
labels = {
|
||||
"istio-injection" : "disabled"
|
||||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
|
||||
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
source = "../../modules/kubernetes/setup_tls_secret"
|
||||
namespace = kubernetes_namespace.anisette.metadata[0].name
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "anisette" {
|
||||
metadata {
|
||||
name = "anisette"
|
||||
namespace = kubernetes_namespace.anisette.metadata[0].name
|
||||
labels = {
|
||||
app = "anisette"
|
||||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
# anisette downloads + initializes Apple's CoreADI provisioning library on
|
||||
# first start (slow, memory-spiky). wait_for_rollout=false so the apply never
|
||||
# blocks on — and never strands out of terraform state — a pod that is still
|
||||
# warming up (mirrors tts/llama-cpp). Pod health is still gated by the
|
||||
# readiness probe below, so the Service only routes once it's actually up.
|
||||
wait_for_rollout = false
|
||||
spec {
|
||||
replicas = 1
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "anisette"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "anisette"
|
||||
}
|
||||
annotations = {
|
||||
# Diun notify-only watch. Upstream tags only :latest, so watch the
|
||||
# digest of :latest rather than a semver pattern.
|
||||
"diun.enable" = "true"
|
||||
"diun.watch_repo" = "false"
|
||||
"diun.include_tags" = "^latest$"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
container {
|
||||
# Pinned by digest — upstream ships only a mutable :latest (no tags).
|
||||
# The `docker.io/` prefix is REQUIRED, not cosmetic: the Kyverno
|
||||
# require-trusted-registries policy allowlists `docker.io/*` but NOT a
|
||||
# bare `dadoum/*` prefix (only enumerated DockerHub user repos like
|
||||
# mendhak/*, mpepping/* are listed in
|
||||
# stacks/kyverno/modules/kyverno/security-policies.tf). A bare
|
||||
# `dadoum/anisette-v3-server@...` is denied at admission; the explicit
|
||||
# docker.io/ registry matches the allowlist and still pulls via the
|
||||
# 10.0.20.10 pull-through cache.
|
||||
image = "docker.io/dadoum/anisette-v3-server@sha256:1e20384985d3c49965f444bef39d627768dacc39ea0dca91f2a535edb7591ba3"
|
||||
name = "anisette"
|
||||
port {
|
||||
name = "http"
|
||||
container_port = 6969
|
||||
}
|
||||
# The image runs as the non-root user "Alcoholic" and writes its
|
||||
# provisioning-library cache here; back it with an emptyDir so the
|
||||
# path is writable (stateless — wiped on restart, re-downloaded).
|
||||
volume_mount {
|
||||
name = "provisioning-cache"
|
||||
mount_path = "/home/Alcoholic/.config/anisette-v3/lib"
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "10m"
|
||||
memory = "256Mi"
|
||||
}
|
||||
limits = {
|
||||
# anisette downloads + initializes Apple's CoreADI provisioning
|
||||
# library at startup, which spikes past 128Mi → OOMKilled (exit
|
||||
# 137) before it can bind :6969. 512Mi gives headroom; steady
|
||||
# state is much lower.
|
||||
memory = "512Mi"
|
||||
}
|
||||
}
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/"
|
||||
port = 6969
|
||||
}
|
||||
period_seconds = 15
|
||||
initial_delay_seconds = 5
|
||||
}
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/"
|
||||
port = 6969
|
||||
}
|
||||
period_seconds = 30
|
||||
failure_threshold = 6
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "provisioning-cache"
|
||||
empty_dir {}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "anisette" {
|
||||
metadata {
|
||||
name = "anisette"
|
||||
namespace = kubernetes_namespace.anisette.metadata[0].name
|
||||
labels = {
|
||||
"app" = "anisette"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
selector = {
|
||||
app = "anisette"
|
||||
}
|
||||
port {
|
||||
name = "http"
|
||||
port = "80"
|
||||
target_port = "6969"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
# auth = "none": SideStore is a native iOS client — it can't replay the
|
||||
# Authentik forward-auth cookie dance, so Authentik would break it (same
|
||||
# reasoning as android-emulator's adb). Internal-only: anisette.viktorbarzin.lan,
|
||||
# allow_local_access_only locks it to the LAN, and it brokers no user data of
|
||||
# ours (it just relays Apple-ID anisette data). Never publicly exposed.
|
||||
auth = "none"
|
||||
namespace = kubernetes_namespace.anisette.metadata[0].name
|
||||
name = "anisette"
|
||||
root_domain = "viktorbarzin.lan"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
allow_local_access_only = true
|
||||
ssl_redirect = false
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "false"
|
||||
}
|
||||
}
|
||||
1
stacks/anisette/secrets
Symbolic link
1
stacks/anisette/secrets
Symbolic link
|
|
@ -0,0 +1 @@
|
|||
../../secrets
|
||||
8
stacks/anisette/terragrunt.hcl
Normal file
8
stacks/anisette/terragrunt.hcl
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
|
@ -49,6 +49,21 @@ resource "authentik_policy_expression" "admin_services_restriction" {
|
|||
|
||||
host = request.context.get("host", "")
|
||||
|
||||
# TripIt External containment fence (ADR-0020 in the tripit repo). Publicly
|
||||
# self-enrolled TripIt users (group "TripIt External", assigned by the
|
||||
# tripit-enrollment flow's user_write) may reach tripit.viktorbarzin.me and
|
||||
# NOTHING else. MUST be the FIRST host-dispatch branch: it is a request.user
|
||||
# predicate that must dominate every host branch below, ESPECIALLY the
|
||||
# default-allow `if host not in ADMIN_ONLY_HOSTS: return True` — placed after
|
||||
# it, a tagged user would slip into other hosts. Safe to add: the group is
|
||||
# net-new and created EMPTY, so this matches zero existing principals (no
|
||||
# lockout). The fence is forward-auth ONLY; OIDC apps (Vault, Immich, …)
|
||||
# contain External users via their own per-app group bindings — see
|
||||
# docs/runbooks/tripit-external-signup.md. NEVER co-assign "TripIt External"
|
||||
# to a trusted/internal user (this branch would fence them out of admin hosts).
|
||||
if ak_is_group_member(request.user, name="TripIt External"):
|
||||
return host == "tripit.viktorbarzin.me"
|
||||
|
||||
# t3 Workstation edge gate: only members of "T3 Users" may reach t3.
|
||||
# Placed BEFORE the ADMIN_ONLY_HOSTS early-return (t3 is intentionally not in
|
||||
# that set — it must not require Home-Server-Admins, just T3 Users membership).
|
||||
|
|
|
|||
22
stacks/authentik/tripit-external.tf
Normal file
22
stacks/authentik/tripit-external.tf
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
# "TripIt External" group — containment anchor for publicly self-enrolled TripIt
|
||||
# users (ADR-0020 in the tripit repo). Members are admitted to
|
||||
# tripit.viktorbarzin.me ONLY and denied every other *.viktorbarzin.me
|
||||
# forward-auth host by the prepended branch in admin-services-restriction.tf.
|
||||
#
|
||||
# Created EMPTY and PARENTLESS, on purpose:
|
||||
# * EMPTY — the no-lockout guarantee. Zero members at apply time => the
|
||||
# prepended policy branch matches zero existing principals => it cannot
|
||||
# change anyone's authorization (contrast authentik_group "T3 Users", which
|
||||
# is created WITH members atomically because THAT gate's safety property is
|
||||
# the opposite). Membership is assigned at RUNTIME by the tripit-enrollment
|
||||
# flow's user_write "Create users group" option (UI-managed per the ADR
|
||||
# management split). Terraform owns only the group's EXISTENCE.
|
||||
# * PARENTLESS — do NOT make this a child of "Allow Login Users". The sensitive
|
||||
# OIDC apps gate on "Home Server Admins" / "Headscale Users" / "Wrongmove
|
||||
# Users" (children of "Allow Login Users") or, for Vault, on "Allow Login
|
||||
# Users" itself (bound as part of ADR-0020). Keeping External out of that
|
||||
# tree is what stops these users reaching OIDC apps — mirrors guest.tf, which
|
||||
# keeps the guest group out of "Allow Login Users" for the same reason.
|
||||
resource "authentik_group" "tripit_external" {
|
||||
name = "TripIt External"
|
||||
}
|
||||
27
stacks/authentik/vault-authz-binding.tf
Normal file
27
stacks/authentik/vault-authz-binding.tf
Normal file
|
|
@ -0,0 +1,27 @@
|
|||
# Vault OIDC authorization fence (ADR-0020). The "Vault" Authentik application had
|
||||
# NO authorization binding (audit 2026-06-15: any authenticated identity could
|
||||
# complete Vault OIDC login and receive Vault's built-in `default`-policy token —
|
||||
# token self-management/cubbyhole, no secret access, but still more than an
|
||||
# outside user should hold). Bind it to "Allow Login Users" so only established
|
||||
# homelab users can log in: they inherit that base group via its children
|
||||
# (Home Server Admins / Headscale Users / Wrongmove Users — verified live that
|
||||
# `User.all_groups()` includes the parent), while publicly self-enrolled
|
||||
# "TripIt External" users (deliberately PARENTLESS, so NOT in Allow Login Users)
|
||||
# are denied at the Vault consent step. Closes the one OIDC app the forward-auth
|
||||
# fence cannot reach; the other sensitive OIDC apps already bind a trusted group.
|
||||
#
|
||||
# The Vault application itself stays UI-managed (like the other OIDC apps); this
|
||||
# adds ONLY the authorization binding. policy_engine_mode on the app is "any", so
|
||||
# one group binding == membership in that group is required to authorize.
|
||||
#
|
||||
# UUIDs are PINNED as literals: this provider version has NO
|
||||
# `data "authentik_application"` data source (CI pipeline 198 failed on it), and
|
||||
# both objects are UI-managed and stable. To re-fetch if either is recreated, run
|
||||
# `ak shell` in the goauthentik-server pod and read
|
||||
# `Application.objects.get(name="Vault").pbm_uuid` and
|
||||
# `Group.objects.get(name="Allow Login Users").group_uuid`.
|
||||
resource "authentik_policy_binding" "vault_allow_login_users" {
|
||||
target = "fe5698e3-b6b1-4475-98fa-ce2bae22f4dd" # Authentik application "Vault" (pbm_uuid)
|
||||
group = "b4823cd7-8ed8-4d2f-8f94-bc285138f853" # group "Allow Login Users" (group_uuid)
|
||||
order = 0
|
||||
}
|
||||
|
|
@ -427,7 +427,7 @@ resource "kubernetes_cron_job_v1" "mysql-backup" {
|
|||
failed_jobs_history_limit = 5
|
||||
schedule = "30 0 * * *"
|
||||
# schedule = "* * * * *"
|
||||
starting_deadline_seconds = 10
|
||||
starting_deadline_seconds = 600
|
||||
successful_jobs_history_limit = 10
|
||||
job_template {
|
||||
metadata {}
|
||||
|
|
@ -519,7 +519,7 @@ resource "kubernetes_cron_job_v1" "mysql-backup-per-db" {
|
|||
concurrency_policy = "Replace"
|
||||
failed_jobs_history_limit = 3
|
||||
schedule = "45 0 * * *"
|
||||
starting_deadline_seconds = 10
|
||||
starting_deadline_seconds = 600
|
||||
successful_jobs_history_limit = 3
|
||||
job_template {
|
||||
metadata {}
|
||||
|
|
@ -1607,7 +1607,12 @@ resource "kubernetes_cron_job_v1" "postgresql-backup" {
|
|||
failed_jobs_history_limit = 5
|
||||
schedule = "0 0 * * *"
|
||||
# schedule = "* * * * *"
|
||||
starting_deadline_seconds = 10
|
||||
# 600s (was 10s): a 10s deadline silently DROPPED the 2026-06-13 00:00 run
|
||||
# when the CronJob controller was late at the midnight backup/IO-storm tick,
|
||||
# leaving the last full dump 37h old (fired PostgreSQLBackupStale). 600s lets
|
||||
# a brief controller lag still launch the job. Same fix on the other three
|
||||
# dbaas backup crons (they share the midnight window).
|
||||
starting_deadline_seconds = 600
|
||||
successful_jobs_history_limit = 10
|
||||
job_template {
|
||||
metadata {}
|
||||
|
|
@ -1695,7 +1700,7 @@ resource "kubernetes_cron_job_v1" "postgresql-backup-per-db" {
|
|||
concurrency_policy = "Replace"
|
||||
failed_jobs_history_limit = 3
|
||||
schedule = "15 0 * * *"
|
||||
starting_deadline_seconds = 10
|
||||
starting_deadline_seconds = 600
|
||||
successful_jobs_history_limit = 3
|
||||
job_template {
|
||||
metadata {}
|
||||
|
|
|
|||
|
|
@ -128,7 +128,7 @@ resource "kubernetes_deployment" "f1-stream" {
|
|||
}
|
||||
spec {
|
||||
container {
|
||||
image = "forgejo.viktorbarzin.me/viktor/f1-stream:${var.image_tag}"
|
||||
image = "ghcr.io/viktorbarzin/f1-stream:${var.image_tag}"
|
||||
image_pull_policy = "Always"
|
||||
name = "f1-stream"
|
||||
# Right-sized 2026-06-05: was 1Gi (bundled-Chromium era). The image is
|
||||
|
|
|
|||
|
|
@ -11,6 +11,12 @@ resource "kubernetes_namespace" "forgejo" {
|
|||
"istio-injection" : "disabled"
|
||||
tier = local.tiers.edge
|
||||
"keel.sh/enrolled" = "true"
|
||||
# Opt out of the auto-generated tier-3-edge ResourceQuota (caps
|
||||
# requests.memory at 4Gi). Forgejo's own pod requests 4Gi (the
|
||||
# git + OCI-registry backbone, Guaranteed QoS), which pegged that
|
||||
# tier quota at 100% and fired KubeQuotaAlmostFull. The
|
||||
# forgejo-specific quota below gives headroom. Same pattern as dbaas.
|
||||
"resource-governance/custom-quota" = "true"
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
|
|
@ -19,6 +25,26 @@ resource "kubernetes_namespace" "forgejo" {
|
|||
}
|
||||
}
|
||||
|
||||
# Custom ResourceQuota — replaces the tier-3-edge auto quota (opted out via the
|
||||
# resource-governance/custom-quota label above). requests.memory is 8Gi so the
|
||||
# 4Gi Forgejo pod sits at ~50% (clears KubeQuotaAlmostFull + the healthcheck
|
||||
# resourcequota check) with room for a transient migration/sidecar pod. To
|
||||
# raise Forgejo's memory limit past 4Gi later, bump requests.memory here too.
|
||||
resource "kubernetes_resource_quota" "forgejo" {
|
||||
metadata {
|
||||
name = "forgejo-quota"
|
||||
namespace = kubernetes_namespace.forgejo.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
hard = {
|
||||
"requests.cpu" = "4"
|
||||
"requests.memory" = "8Gi"
|
||||
"limits.memory" = "32Gi"
|
||||
pods = "30"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
source = "../../modules/kubernetes/setup_tls_secret"
|
||||
namespace = kubernetes_namespace.forgejo.metadata[0].name
|
||||
|
|
@ -168,19 +194,29 @@ resource "kubernetes_deployment" "forgejo" {
|
|||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
# Bumped 1Gi -> 3Gi 2026-06-09: Forgejo was OOMKilled (exit 137)
|
||||
# under registry-push load from in-cluster CI builds (tripit
|
||||
# buildkit pushes large layers into the OCI registry). VPA
|
||||
# upperBound reads ~1.5Gi, but that's suppressed by the 1Gi cap it
|
||||
# kept OOMing against — size for the push spike, not steady-state.
|
||||
# Bumped 1Gi -> 3Gi 2026-06-09, then 3Gi -> 4Gi 2026-06-13.
|
||||
# OOMKilled again (exit 137) at the 3Gi cap on 2026-06-13 (2
|
||||
# restarts; briefly took the git remote + OCI registry down and
|
||||
# spiked ingress TTFB/4xx). Steady-state ~2.2Gi but it spiked past
|
||||
# the 3Gi cap. 4Gi is the CEILING here: the forgejo namespace
|
||||
# tier-quota caps requests.memory at 4Gi and Guaranteed QoS means
|
||||
# request == limit, so a pod can request at most 4Gi. A first
|
||||
# attempt at 6Gi was REJECTED (FailedCreate: exceeded quota) and
|
||||
# left forgejo with 0 pods until reverted -- do NOT raise memory
|
||||
# past 4Gi without ALSO raising the tier-quota. The 6/9 OOM driver
|
||||
# (tripit buildkit registry pushes) is gone now that the Forgejo
|
||||
# registry was frozen + emptied 2026-06-13 (ADR-0002, ghcr), so the
|
||||
# remaining spike is git ops / integrity-probe catalog walk / a
|
||||
# possible leak; 4Gi should suffice. If it still OOMs, raise the
|
||||
# tier-quota and this limit together.
|
||||
# requests=limits (Guaranteed QoS) per the repo memory convention.
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "15m"
|
||||
memory = "3Gi"
|
||||
memory = "4Gi"
|
||||
}
|
||||
limits = {
|
||||
memory = "3Gi"
|
||||
memory = "4Gi"
|
||||
}
|
||||
}
|
||||
port {
|
||||
|
|
|
|||
|
|
@ -9,7 +9,7 @@ resource "kubernetes_namespace" "health" {
|
|||
metadata {
|
||||
name = "health"
|
||||
labels = {
|
||||
tier = local.tiers.aux
|
||||
tier = local.tiers.aux
|
||||
"keel.sh/enrolled" = "true"
|
||||
}
|
||||
}
|
||||
|
|
@ -128,6 +128,15 @@ resource "kubernetes_deployment" "health" {
|
|||
name = "COOKIE_SECURE"
|
||||
value = "true"
|
||||
}
|
||||
env {
|
||||
# ADR-0008 (health repo): identity for the internal LAN test host.
|
||||
# Only reached when no X-authentik-email header is present — i.e. via
|
||||
# the auth="none" test ingress below. The public host's forward-auth
|
||||
# fails closed, so requests arriving there always carry the real
|
||||
# header and never fall back to this value.
|
||||
name = "DEV_AUTH_EMAIL"
|
||||
value = "vbarzin@gmail.com"
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "uploads"
|
||||
|
|
@ -197,6 +206,15 @@ module "ingress" {
|
|||
name = "health"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
max_body_size = "100m"
|
||||
# The redesigned SPA bursts well past the default 10/50 limiter on each page
|
||||
# load (shell + fonts + a 5-8 call API burst). Swap the shared limiter for a
|
||||
# health-specific one (100/1000), mirroring tripit/immich/actualbudget.
|
||||
# The ref MUST carry the middleware's namespace prefix: the CRD lives in the
|
||||
# `traefik` ns, so it's `traefik-health-rate-limit@kubernetescrd` (same form as
|
||||
# traefik-tripit-rate-limit). Without the prefix Traefik can't resolve it and
|
||||
# 404s the whole router.
|
||||
skip_default_rate_limit = true
|
||||
extra_middlewares = ["traefik-health-rate-limit@kubernetescrd"]
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Health"
|
||||
|
|
@ -207,6 +225,30 @@ module "ingress" {
|
|||
}
|
||||
}
|
||||
|
||||
# https://health-test.viktorbarzin.lan — internal LAN-only test host for
|
||||
# automated/E2E testing + manual screenshots without the Authentik SSO dance
|
||||
# (ADR-0008). Same `health` deployment; acts as DEV_AUTH_EMAIL=vbarzin@gmail.com.
|
||||
module "ingress_test" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
# auth = "none": LAN-only (allow_local_access_only) test host — no public
|
||||
# exposure; the public health.viktorbarzin.me ingress above stays
|
||||
# auth="required". No user data gate here by design — it serves the real app
|
||||
# as DEV_AUTH_EMAIL since no X-authentik-email is injected (ADR-0008).
|
||||
auth = "none"
|
||||
namespace = kubernetes_namespace.health.metadata[0].name
|
||||
name = "health-test"
|
||||
root_domain = "viktorbarzin.lan"
|
||||
service_name = kubernetes_service.health.metadata[0].name
|
||||
tls_secret_name = var.tls_secret_name
|
||||
allow_local_access_only = true
|
||||
ssl_redirect = false
|
||||
max_body_size = "100m"
|
||||
anti_ai_scraping = false
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "false"
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret_db" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
FROM node:22-alpine AS build
|
||||
WORKDIR /app
|
||||
COPY package*.json ./
|
||||
RUN npm ci
|
||||
RUN npm install --no-audit --no-fund
|
||||
COPY . .
|
||||
RUN npm run build
|
||||
|
||||
|
|
|
|||
1068
stacks/k8s-portal/modules/k8s-portal/files/package-lock.json
generated
Normal file
1068
stacks/k8s-portal/modules/k8s-portal/files/package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load diff
24
stacks/k8s-portal/modules/k8s-portal/files/package.json
Normal file
24
stacks/k8s-portal/modules/k8s-portal/files/package.json
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
{
|
||||
"name": "k8s-portal",
|
||||
"private": true,
|
||||
"version": "0.0.1",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "vite dev",
|
||||
"build": "vite build",
|
||||
"preview": "vite preview",
|
||||
"prepare": "svelte-kit sync || echo ''",
|
||||
"check": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json",
|
||||
"check:watch": "svelte-kit sync && svelte-check --tsconfig ./tsconfig.json --watch"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@sveltejs/adapter-auto": "^7.0.0",
|
||||
"@sveltejs/adapter-node": "^5.5.3",
|
||||
"@sveltejs/kit": "^2.50.2",
|
||||
"@sveltejs/vite-plugin-svelte": "^6.2.4",
|
||||
"svelte": "^5.49.2",
|
||||
"svelte-check": "^4.3.6",
|
||||
"typescript": "^5.9.3",
|
||||
"vite": "^7.3.1"
|
||||
}
|
||||
}
|
||||
|
|
@ -9,7 +9,7 @@ resource "kubernetes_namespace" "k8s_portal" {
|
|||
metadata {
|
||||
name = "k8s-portal"
|
||||
labels = {
|
||||
tier = var.tier
|
||||
tier = var.tier
|
||||
"keel.sh/enrolled" = "true"
|
||||
}
|
||||
}
|
||||
|
|
@ -40,6 +40,15 @@ resource "kubernetes_deployment" "k8s_portal" {
|
|||
metadata {
|
||||
name = "k8s-portal"
|
||||
namespace = kubernetes_namespace.k8s_portal.metadata[0].name
|
||||
# ADR-0002 / no-local-builds: image now GHA-built -> ghcr:latest
|
||||
# (.github/workflows/build-k8s-portal.yml). Keel polls ghcr:latest and rolls
|
||||
# this deployment (replaces the removed Woodpecker in-cluster build+deploy).
|
||||
annotations = {
|
||||
"keel.sh/policy" = "force"
|
||||
"keel.sh/trigger" = "poll"
|
||||
"keel.sh/pollSchedule" = "@every 5m"
|
||||
"keel.sh/match-tag" = "true"
|
||||
}
|
||||
labels = {
|
||||
app = "k8s-portal"
|
||||
tier = var.tier
|
||||
|
|
@ -66,9 +75,16 @@ resource "kubernetes_deployment" "k8s_portal" {
|
|||
}
|
||||
|
||||
spec {
|
||||
# GHCR pull secret: the ghcr-credentials Secret in this namespace is
|
||||
# cloned in by the kyverno stack's sync-ghcr-credentials ClusterPolicy
|
||||
# (allowlisted private-ghcr namespaces only — ADR-0002). Source of
|
||||
# truth: stacks/kyverno/modules/kyverno/ghcr-credentials.tf.
|
||||
image_pull_secrets {
|
||||
name = "ghcr-credentials"
|
||||
}
|
||||
container {
|
||||
name = "portal"
|
||||
image = "viktorbarzin/k8s-portal:latest"
|
||||
image = "ghcr.io/viktorbarzin/k8s-portal:latest"
|
||||
port {
|
||||
container_port = 3000
|
||||
}
|
||||
|
|
@ -121,7 +137,8 @@ resource "kubernetes_deployment" "k8s_portal" {
|
|||
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA); Kyverno mutates dns_config for ndots. Reviewed 2026-04-18.
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
spec[0].template[0].spec[0].container[0].image, # CI updates image tag
|
||||
spec[0].template[0].spec[0].container[0].image, # Keel manages ghcr:latest digest
|
||||
metadata[0].annotations["keel.sh/update-time"], # KEEL_LIFECYCLE_V1 (Keel stamps on roll)
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -172,5 +189,5 @@ module "ingress_setup_script" {
|
|||
ingress_path = ["/setup/script", "/agent"]
|
||||
tls_secret_name = var.tls_secret_name
|
||||
# auth = "none": Setup script + agent endpoint must be curl-able without auth (no cookies preserved in automation).
|
||||
auth = "none"
|
||||
auth = "none"
|
||||
}
|
||||
|
|
|
|||
|
|
@ -27,6 +27,10 @@ locals {
|
|||
# openclaw's install-recruiter-plugin init container pulls the PRIVATE
|
||||
# ghcr.io/viktorbarzin/recruiter-responder:latest image (infra#27).
|
||||
"openclaw",
|
||||
# k8s-portal: last in-cluster image build, migrated to GHA→ghcr (ADR-0002,
|
||||
# "no local builds"). ghcr.io/viktorbarzin/k8s-portal:latest is PRIVATE
|
||||
# (infra repo default); the deployment references the cloned secret.
|
||||
"k8s-portal",
|
||||
]
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -553,7 +553,7 @@ resource "kubernetes_deployment" "openclaw" {
|
|||
# IfNotPresent: a cached stale :latest meant the plugin manifest
|
||||
# (configSchema fix) never got pulled. An uncached SHA forces the
|
||||
# pull. Bump this when the openclaw plugin in nextcloud-todos changes.
|
||||
image = "forgejo.viktorbarzin.me/viktor/nextcloud-todos:f85c6de1"
|
||||
image = "ghcr.io/viktorbarzin/nextcloud-todos:latest"
|
||||
image_pull_policy = "Always"
|
||||
command = ["sh", "-c", <<-EOT
|
||||
set -eu
|
||||
|
|
|
|||
136
stacks/t3-afk/files/dispatcher.js
Normal file
136
stacks/t3-afk/files/dispatcher.js
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
// t3-afk auto-pair dispatcher
|
||||
// ----------------------------------------------------------------------------
|
||||
// Replicates the devvm t3-dispatch experience for the single in-cluster T3
|
||||
// instance. The ingress is Authentik-gated (auth=required), so every request
|
||||
// that reaches here is already authenticated. On a cookieless *document*
|
||||
// navigation we mint a one-time pairing credential (`t3 auth pairing create`)
|
||||
// and exchange it at the t3 server's /api/auth/browser-session endpoint for the
|
||||
// `t3_session` cookie, then 302 back — so the user never sees the manual
|
||||
// /pair#token screen. Everything else (incl. WebSocket upgrades for the cockpit
|
||||
// live stream + terminals) is reverse-proxied straight through to t3 serve.
|
||||
//
|
||||
// Single upstream, same pod (localhost) — kept dependency-free (Node stdlib).
|
||||
'use strict';
|
||||
const http = require('http');
|
||||
const net = require('net');
|
||||
const { execFile } = require('child_process');
|
||||
|
||||
const UPSTREAM_HOST = '127.0.0.1';
|
||||
const UPSTREAM_PORT = Number(process.env.T3_UPSTREAM_PORT || 3773);
|
||||
const LISTEN_PORT = Number(process.env.DISPATCHER_PORT || 8080);
|
||||
const T3_BIN = process.env.T3_BIN || '/data/npm-global/bin/t3';
|
||||
const BASE_DIR = process.env.T3CODE_HOME || '/data/t3';
|
||||
const COOKIE = 't3_session';
|
||||
const childEnv = { ...process.env, PATH: '/data/npm-global/bin:' + (process.env.PATH || ''), HOME: '/home/node' };
|
||||
|
||||
const hasSession = (req) =>
|
||||
(req.headers.cookie || '').split(/;\s*/).some((c) => c.startsWith(COOKIE + '='));
|
||||
|
||||
const isDocNav = (req) => {
|
||||
if (req.method !== 'GET') return false;
|
||||
const dest = req.headers['sec-fetch-dest'];
|
||||
if (dest) return dest === 'document';
|
||||
return (req.headers['accept'] || '').includes('text/html');
|
||||
};
|
||||
|
||||
const mintCredential = () =>
|
||||
new Promise((resolve, reject) => {
|
||||
execFile(
|
||||
T3_BIN,
|
||||
['auth', 'pairing', 'create', '--base-dir', BASE_DIR, '--ttl', '5m', '--json'],
|
||||
{ env: childEnv, timeout: 15000 },
|
||||
(err, stdout) => {
|
||||
if (err) return reject(err);
|
||||
try {
|
||||
const cred = JSON.parse(stdout).credential;
|
||||
cred ? resolve(cred) : reject(new Error('no credential in pairing output'));
|
||||
} catch (e) {
|
||||
reject(e);
|
||||
}
|
||||
},
|
||||
);
|
||||
});
|
||||
|
||||
const exchange = (credential) =>
|
||||
new Promise((resolve, reject) => {
|
||||
const body = JSON.stringify({ credential });
|
||||
const r = http.request(
|
||||
{
|
||||
host: UPSTREAM_HOST,
|
||||
port: UPSTREAM_PORT,
|
||||
path: '/api/auth/browser-session',
|
||||
method: 'POST',
|
||||
headers: { 'content-type': 'application/json', 'content-length': Buffer.byteLength(body) },
|
||||
},
|
||||
(resp) => {
|
||||
const setCookie = resp.headers['set-cookie'] || [];
|
||||
resp.resume();
|
||||
resp.on('end', () =>
|
||||
resp.statusCode === 200 && setCookie.length
|
||||
? resolve(setCookie)
|
||||
: reject(new Error('browser-session exchange returned ' + resp.statusCode)),
|
||||
);
|
||||
},
|
||||
);
|
||||
r.on('error', reject);
|
||||
r.write(body);
|
||||
r.end();
|
||||
});
|
||||
|
||||
const proxyHttp = (req, res) => {
|
||||
const up = http.request(
|
||||
{ host: UPSTREAM_HOST, port: UPSTREAM_PORT, path: req.url, method: req.method, headers: req.headers },
|
||||
(r) => {
|
||||
res.writeHead(r.statusCode, r.headers);
|
||||
r.pipe(res);
|
||||
},
|
||||
);
|
||||
up.on('error', () => {
|
||||
if (!res.headersSent) res.writeHead(502);
|
||||
res.end('bad gateway');
|
||||
});
|
||||
req.pipe(up);
|
||||
};
|
||||
|
||||
const server = http.createServer(async (req, res) => {
|
||||
if (req.url === '/healthz') {
|
||||
res.writeHead(200);
|
||||
return res.end('ok');
|
||||
}
|
||||
if (!hasSession(req) && isDocNav(req)) {
|
||||
try {
|
||||
const cred = await mintCredential();
|
||||
const setCookie = await exchange(cred);
|
||||
res.writeHead(302, { location: req.url || '/', 'set-cookie': setCookie, 'cache-control': 'no-store' });
|
||||
return res.end();
|
||||
} catch (err) {
|
||||
// Fall through to a plain proxy; the cockpit's own /pair screen is the
|
||||
// fallback if auto-pair ever fails, so we never hard-fail the request.
|
||||
console.error('auto-pair failed, proxying through:', err.message);
|
||||
}
|
||||
}
|
||||
proxyHttp(req, res);
|
||||
});
|
||||
|
||||
// WebSocket / Upgrade passthrough — the cockpit's live orchestration stream and
|
||||
// terminals need this. Reconstruct the upgrade request and splice the sockets.
|
||||
server.on('upgrade', (req, socket, head) => {
|
||||
const up = net.connect(UPSTREAM_PORT, UPSTREAM_HOST, () => {
|
||||
up.write(
|
||||
`${req.method} ${req.url} HTTP/1.1\r\n` +
|
||||
Object.entries(req.headers)
|
||||
.map(([k, v]) => `${k}: ${v}`)
|
||||
.join('\r\n') +
|
||||
'\r\n\r\n',
|
||||
);
|
||||
if (head && head.length) up.write(head);
|
||||
socket.pipe(up);
|
||||
up.pipe(socket);
|
||||
});
|
||||
up.on('error', () => socket.destroy());
|
||||
socket.on('error', () => up.destroy());
|
||||
});
|
||||
|
||||
server.listen(LISTEN_PORT, '0.0.0.0', () =>
|
||||
console.log(`t3-afk dispatcher listening on :${LISTEN_PORT} -> ${UPSTREAM_HOST}:${UPSTREAM_PORT}`),
|
||||
);
|
||||
59
stacks/t3-afk/files/issue-implementer-CLAUDE.md
Normal file
59
stacks/t3-afk/files/issue-implementer-CLAUDE.md
Normal file
|
|
@ -0,0 +1,59 @@
|
|||
# issue-implementer — autonomous AFK coding agent
|
||||
|
||||
You are **issue-implementer**, an autonomous agent that implements ONE GitHub
|
||||
issue end-to-end and lands it, with no human at the keyboard. This file is your
|
||||
standing behaviour; the specific task arrives as your prompt. You run inside a
|
||||
T3 Code thread in `full-access` mode (skip-permissions) — there is no one to
|
||||
answer questions mid-run.
|
||||
|
||||
## Autonomy — non-negotiable (you will hang otherwise)
|
||||
|
||||
- **Never enter plan mode and never call `ExitPlanMode`.** It is intercepted and
|
||||
will stall this thread forever.
|
||||
- **Never ask clarifying questions / never call `AskUserQuestion`.** No human is
|
||||
watching. Make the most reasonable assumption, state it in a commit/your final
|
||||
message, and proceed.
|
||||
- If you hit something you genuinely cannot resolve safely, **stop and write a
|
||||
precise blocker report as your final message** (what you tried, what's
|
||||
unresolved, what you'd need). Do not thrash. The orchestrator escalates it to a
|
||||
human — that is the only "ask for help" channel you have.
|
||||
|
||||
## What to do
|
||||
|
||||
1. **Understand the task.** Your prompt contains the issue (number, what to
|
||||
build, acceptance criteria). Read the issue's AGENT-BRIEF if present.
|
||||
2. **Work in the prepared worktree.** You are already in a git worktree on a
|
||||
branch off `master`. Read the repo's own `CLAUDE.md`, `CONTEXT.md`, and any
|
||||
`docs/adr/` in the area you touch — use its domain vocabulary and respect its
|
||||
decisions.
|
||||
3. **Test-first (TDD).** Write a failing test that captures the desired
|
||||
behaviour, make it pass, then refactor. Prefer property/parameterized tests.
|
||||
Run the repo's actual test suite and get it green before you commit. Do not
|
||||
test implementation details — test external behaviour.
|
||||
4. **Commit.** Subject = what changed; body = why, paraphrasing the issue in
|
||||
plain words. Include `Closes #<issue-number>` and the trailer
|
||||
`Implemented-by: issue-implementer (AFK)`. Stage files by name — never
|
||||
`git add -A`/`.`. Never skip hooks.
|
||||
5. **Land it.** Push your branch to `master` (`git push origin HEAD:master`). If
|
||||
the push is rejected non-fast-forward, fetch, merge `origin/master`, re-run
|
||||
the tests, and push again. Pushing to `master` is the intended behaviour —
|
||||
CI builds and deploys from there.
|
||||
6. **Report.** Your final message is a concise summary: what you built, the
|
||||
commit, and anything a reviewer should know. (CI/deploy watching and any
|
||||
fix-forward/freeze handling are done by the control plane, not by you — once
|
||||
you've pushed green code, your job is done.)
|
||||
|
||||
## Guardrails (hard limits)
|
||||
|
||||
- **Never force-push** to `master`.
|
||||
- **Never delete PVCs/PVs**, drop database tables, or run destructive data ops.
|
||||
- **Never edit Vault directly**, and never commit secrets.
|
||||
- **Infrastructure changes go through Terraform/Terragrunt only** — never
|
||||
`kubectl apply/edit/patch` as the final state.
|
||||
- **Never use `[ci skip]`** — it hides the change from the audit feed.
|
||||
- Stay within the issue's scope. Don't refactor adjacent code beyond what the
|
||||
task needs.
|
||||
|
||||
## Done means
|
||||
|
||||
Tests green **and** pushed to `master`. Not "code written" — landed.
|
||||
408
stacks/t3-afk/main.tf
Normal file
408
stacks/t3-afk/main.tf
Normal file
|
|
@ -0,0 +1,408 @@
|
|||
# =============================================================================
|
||||
# t3-afk — dedicated, in-cluster T3 Code instance: the EXECUTOR + COCKPIT for the
|
||||
# AFK implementation pipeline (slice #2 of claude-agent-service PRD #1).
|
||||
#
|
||||
# claude-agent-service (control plane) dispatches issues INTO this T3 instance
|
||||
# over its orchestration HTTP API; T3 runs the issue-implementer agent in a git
|
||||
# worktree and shows every worker in its cockpit. See:
|
||||
# claude-agent-service/docs/2026-06-14-afk-implementation-pipeline-design.md
|
||||
# claude-agent-service/docs/adr/0003-t3-thin-executor-and-cockpit.md
|
||||
#
|
||||
# PILOT SHORTCUT (chosen 2026-06-14): no custom-built image. We run stock
|
||||
# `node:24` (the full image ships git + python3/make/g++ for node-pty) and an
|
||||
# init container installs PINNED npm packages (t3@0.0.27 + the Claude CLI) onto
|
||||
# the SSD PVC, cached across restarts. Formalize a digest-pinned built image
|
||||
# post-GO. T3 is version-pinned (npm) and NOT Keel-enrolled.
|
||||
# =============================================================================
|
||||
|
||||
# No plan-time Vault reads — every secret flows through the ExternalSecret below
|
||||
# (CLAUDE_CODE_OAUTH_TOKEN / GITHUB_TOKEN / FORGEJO_TOKEN), injected as env at
|
||||
# runtime. Nothing here needs a secret value at plan time.
|
||||
|
||||
# Wildcard TLS secret name — value comes from config.tfvars; consumed by the
|
||||
# ingress factory (every stack that uses the factory declares this).
|
||||
variable "tls_secret_name" {}
|
||||
|
||||
locals {
|
||||
namespace = "t3-afk"
|
||||
# Stock node base — the FULL node:24 (not -slim) is buildpack-deps-based, so it
|
||||
# ships git + build-essential (python3/make/g++) that node-pty + the agent need.
|
||||
# Fully-qualified (docker.io/library/...) to satisfy the Kyverno
|
||||
# require-trusted-registries allowlist via `docker.io/*` — bare `node*` is NOT
|
||||
# on the bare-DockerHub-library list (alpine*/busybox*/python* are).
|
||||
image = "docker.io/library/node:24"
|
||||
# Pinned npm versions installed at startup (the reproducibility anchor for the
|
||||
# pilot until a digest-pinned image exists).
|
||||
t3_version = "0.0.27"
|
||||
claude_cli_version = "latest" # @anthropic-ai/claude-code
|
||||
labels = {
|
||||
app = "t3-afk"
|
||||
}
|
||||
}
|
||||
|
||||
# --- Namespace ---
|
||||
|
||||
resource "kubernetes_namespace" "t3_afk" {
|
||||
metadata {
|
||||
name = local.namespace
|
||||
labels = {
|
||||
tier = local.tiers.aux
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# --- Secrets ---
|
||||
# The Claude provider authenticates with CLAUDE_CODE_OAUTH_TOKEN (T3 passes the
|
||||
# environment straight through to the embedded claude-agent-sdk + claude CLI).
|
||||
# GITHUB_TOKEN / FORGEJO_TOKEN authenticate the agent's `git push` from worktrees
|
||||
# (wired into ~/.gitconfig insteadOf rewrites in the container command).
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "t3-afk-secrets"
|
||||
namespace = local.namespace
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-kv"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = { name = "t3-afk-secrets" }
|
||||
data = [
|
||||
{
|
||||
secretKey = "CLAUDE_CODE_OAUTH_TOKEN"
|
||||
remoteRef = { key = "claude-agent-service", property = "claude_oauth_token" }
|
||||
},
|
||||
{
|
||||
secretKey = "GITHUB_TOKEN"
|
||||
remoteRef = { key = "viktor", property = "github_pat" }
|
||||
},
|
||||
{
|
||||
# Shared viktor-scoped admin PAT (also used by Woodpecker + the
|
||||
# claude-agent pod). Lets the agent git push / open PRs on Forgejo.
|
||||
secretKey = "FORGEJO_TOKEN"
|
||||
remoteRef = { key = "ci/global", property = "forgejo_push_token" }
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
depends_on = [kubernetes_namespace.t3_afk]
|
||||
}
|
||||
|
||||
# issue-implementer behaviour is intentionally NOT mounted as ~/.claude/CLAUDE.md:
|
||||
# T3's SDK invocation doesn't honor it, and mounting a subPath into ~/.claude
|
||||
# makes that dir root-owned and breaks the agent's Bash session-env. The control
|
||||
# plane injects the behaviour as a dispatch message preamble instead;
|
||||
# files/issue-implementer-CLAUDE.md is kept as the canonical source for that text.
|
||||
|
||||
# Auto-pair dispatcher script (run by the sidecar container below). Mirrors the
|
||||
# devvm t3-dispatch: on a cookieless, Authentik-gated page load it mints a
|
||||
# pairing credential and exchanges it for the t3_session cookie, so the user
|
||||
# never sees the manual /pair screen. Reverse-proxies everything else (incl.
|
||||
# WebSockets) to t3 serve.
|
||||
resource "kubernetes_config_map" "dispatcher" {
|
||||
metadata {
|
||||
name = "t3-afk-dispatcher"
|
||||
namespace = kubernetes_namespace.t3_afk.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
"dispatcher.js" = file("${path.module}/files/dispatcher.js")
|
||||
}
|
||||
}
|
||||
|
||||
# --- Storage ---
|
||||
# SSD-NFS (small-file friendly) for the T3 base dir: state.sqlite + the
|
||||
# server-signing-key (losing it invalidates every issued bearer), per-thread git
|
||||
# worktrees, the npm global install, and caches. ADR 0004.
|
||||
module "data" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "t3-afk-data"
|
||||
namespace = kubernetes_namespace.t3_afk.metadata[0].name
|
||||
nfs_server = "192.168.1.127"
|
||||
nfs_path = "/srv/nfs-ssd/t3-afk-data"
|
||||
storage = "30Gi"
|
||||
}
|
||||
|
||||
# --- Deployment ---
|
||||
|
||||
resource "kubernetes_deployment" "t3_afk" {
|
||||
# Slow first start (image pull + npm install init + ESO secret sync) can
|
||||
# exceed the default rollout-wait timeout; verify pod readiness out-of-band.
|
||||
wait_for_rollout = false
|
||||
|
||||
metadata {
|
||||
name = "t3-afk"
|
||||
namespace = kubernetes_namespace.t3_afk.metadata[0].name
|
||||
labels = local.labels
|
||||
# keel.sh/policy=never must be a DEPLOYMENT-level annotation — that's where
|
||||
# Keel reads it. (A pod-template label is ignored by Keel, which is why the
|
||||
# earlier attempt failed.) The cluster's Kyverno inject-keel-annotations
|
||||
# policy is opt-OUT: it stamps policy=patch on any workload that doesn't
|
||||
# carry its own keel.sh/policy — and Keel then "patch"-downgraded
|
||||
# node:24 -> node:24.0.2 (below t3@0.0.27's required node >=24.10), which
|
||||
# crash-looped `t3 serve`. ADR 0003 (Keel-excluded).
|
||||
annotations = {
|
||||
"keel.sh/policy" = "never"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
replicas = 1
|
||||
# Single-writer state.sqlite — never run two pods against the same base dir.
|
||||
strategy {
|
||||
type = "Recreate"
|
||||
}
|
||||
|
||||
selector {
|
||||
match_labels = local.labels
|
||||
}
|
||||
|
||||
template {
|
||||
metadata {
|
||||
labels = local.labels
|
||||
}
|
||||
|
||||
spec {
|
||||
security_context {
|
||||
run_as_user = 1000 # node
|
||||
run_as_group = 1000
|
||||
fs_group = 1000
|
||||
}
|
||||
|
||||
# NFS mounts land root-owned; make /data writable by uid 1000.
|
||||
init_container {
|
||||
name = "fix-perms"
|
||||
image = "busybox:1.37"
|
||||
command = ["sh", "-c", "mkdir -p /data && chown -R 1000:1000 /data && chmod 0775 /data"]
|
||||
security_context {
|
||||
run_as_user = 0
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
resources {
|
||||
requests = { memory = "32Mi" }
|
||||
limits = { memory = "64Mi" }
|
||||
}
|
||||
}
|
||||
|
||||
# Install pinned t3 + Claude CLI onto the PVC (cached; skipped if already
|
||||
# present). Runs as uid 1000 so the install is owned by the runtime user.
|
||||
init_container {
|
||||
name = "install-t3"
|
||||
image = local.image
|
||||
command = ["bash", "-c", <<-EOF
|
||||
set -e
|
||||
export npm_config_cache=/data/npm-cache
|
||||
export npm_config_prefix=/data/npm-global
|
||||
mkdir -p /data/npm-global /data/npm-cache
|
||||
if [ ! -x /data/npm-global/bin/t3 ]; then
|
||||
echo "installing t3@${local.t3_version} + claude CLI ..."
|
||||
npm install -g "t3@${local.t3_version}" "@anthropic-ai/claude-code@${local.claude_cli_version}"
|
||||
else
|
||||
echo "t3 already installed: $(/data/npm-global/bin/t3 --version 2>/dev/null || echo unknown)"
|
||||
fi
|
||||
EOF
|
||||
]
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "200m", memory = "512Mi" }
|
||||
limits = { memory = "1Gi" }
|
||||
}
|
||||
}
|
||||
|
||||
container {
|
||||
name = "t3"
|
||||
image = local.image
|
||||
|
||||
# Configure git auth for the agent's pushes, then run T3 headless.
|
||||
# $$ escapes Terraform interpolation so the shell expands the env vars.
|
||||
command = ["bash", "-c", <<-EOF
|
||||
set -e
|
||||
export PATH=/data/npm-global/bin:$$PATH
|
||||
export npm_config_cache=/data/npm-cache
|
||||
|
||||
# git identity + token rewrites so the agent can push from worktrees.
|
||||
git config --global user.name "issue-implementer (AFK)"
|
||||
git config --global user.email "afk-agent@viktorbarzin.me"
|
||||
git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "https://github.com/"
|
||||
git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "git@github.com:"
|
||||
if [ -n "$${FORGEJO_TOKEN}" ]; then
|
||||
git config --global url."https://$${FORGEJO_TOKEN}@forgejo.viktorbarzin.me/".insteadOf "https://forgejo.viktorbarzin.me/"
|
||||
fi
|
||||
|
||||
exec t3 serve --mode web --host 0.0.0.0 --port 3773 --base-dir /data/t3
|
||||
EOF
|
||||
]
|
||||
|
||||
port {
|
||||
container_port = 3773
|
||||
}
|
||||
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "t3-afk-secrets"
|
||||
}
|
||||
}
|
||||
|
||||
env {
|
||||
name = "HOME"
|
||||
value = "/home/node"
|
||||
}
|
||||
env {
|
||||
name = "T3CODE_HOME"
|
||||
value = "/data/t3"
|
||||
}
|
||||
|
||||
# T3's API needs auth even for liveness; use a TCP probe on the port.
|
||||
liveness_probe {
|
||||
tcp_socket {
|
||||
port = 3773
|
||||
}
|
||||
initial_delay_seconds = 30
|
||||
period_seconds = 30
|
||||
}
|
||||
readiness_probe {
|
||||
tcp_socket {
|
||||
port = 3773
|
||||
}
|
||||
initial_delay_seconds = 15
|
||||
period_seconds = 10
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
# NOTE: do NOT mount anything into /home/node/.claude — a subPath
|
||||
# mount makes that dir root-owned, which blocks the agent (uid 1000)
|
||||
# from creating its Bash session-env there and breaks ALL Bash/git for
|
||||
# the agent (root cause of the 2026-06-15 "agent never commits"). T3's
|
||||
# SDK invocation doesn't honor ~/.claude/CLAUDE.md anyway, so the
|
||||
# issue-implementer behaviour is injected via the dispatch message
|
||||
# preamble by the control plane instead.
|
||||
|
||||
# Burstable (tier-aux). A live agent thread (node + claude) is memory
|
||||
# heavy; size for a small number of concurrent threads on this pilot
|
||||
# instance. No CPU limit per cluster policy.
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "1"
|
||||
memory = "2Gi"
|
||||
}
|
||||
# Capped at the tier-aux LimitRange max (4Gi/container). If real
|
||||
# workloads OOM, opt the namespace out via the
|
||||
# resource-governance/custom-limitrange label (as claude-agent-service
|
||||
# does) and raise this.
|
||||
limits = {
|
||||
memory = "4Gi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Auto-pair dispatcher (sidecar). The Service points at this (:8080); it
|
||||
# reverse-proxies to t3 serve (:3773) and injects the session cookie so
|
||||
# the browser experience matches t3.viktorbarzin.me. Shares /data so it
|
||||
# can exec the t3 CLI to mint pairing credentials.
|
||||
container {
|
||||
name = "dispatcher"
|
||||
image = local.image
|
||||
command = ["node", "/scripts/dispatcher.js"]
|
||||
port {
|
||||
container_port = 8080
|
||||
}
|
||||
env {
|
||||
name = "HOME"
|
||||
value = "/home/node"
|
||||
}
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 10
|
||||
period_seconds = 10
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
volume_mount {
|
||||
name = "dispatcher"
|
||||
mount_path = "/scripts"
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "50m", memory = "64Mi" }
|
||||
limits = { memory = "256Mi" }
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.data.claim_name
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "dispatcher"
|
||||
config_map {
|
||||
name = kubernetes_config_map.dispatcher.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
# Kyverno's inject-keel-annotations stamps pollSchedule/trigger alongside
|
||||
# the policy; we own keel.sh/policy=never above, but ignore these two so
|
||||
# they don't perpetually drift the plan.
|
||||
metadata[0].annotations["keel.sh/pollSchedule"],
|
||||
metadata[0].annotations["keel.sh/trigger"],
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
# --- Service ---
|
||||
|
||||
resource "kubernetes_service" "t3_afk" {
|
||||
metadata {
|
||||
name = "t3-afk"
|
||||
namespace = kubernetes_namespace.t3_afk.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
spec {
|
||||
selector = local.labels
|
||||
# Route to the auto-pair dispatcher sidecar (:8080), which reverse-proxies
|
||||
# to t3 serve (:3773) after injecting the t3_session cookie.
|
||||
port {
|
||||
port = 3773
|
||||
target_port = 8080
|
||||
}
|
||||
type = "ClusterIP"
|
||||
}
|
||||
}
|
||||
|
||||
# --- Ingress ---
|
||||
# The cockpit has no built-in user auth, so Authentik forward-auth is the gate.
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
auth = "required"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.t3_afk.metadata[0].name
|
||||
name = "t3-afk"
|
||||
service_name = kubernetes_service.t3_afk.metadata[0].name
|
||||
port = 3773
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
18
stacks/t3-afk/terragrunt.hcl
Normal file
18
stacks/t3-afk/terragrunt.hcl
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "platform" {
|
||||
config_path = "../platform"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
||||
dependency "vault" {
|
||||
config_path = "../vault"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
||||
dependency "external-secrets" {
|
||||
config_path = "../external-secrets"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
|
@ -225,8 +225,11 @@ module "ingress_ro" {
|
|||
# https://forgejo.viktorbarzin.me/viktor/terminal-lobby
|
||||
#
|
||||
# That repo's ./scripts/deploy.sh ships everything to wizard@10.0.10.10
|
||||
# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. This stack
|
||||
# only owns the Kubernetes side: Services, Endpoints pointing at
|
||||
# and restarts ttyd / ttyd-ro / tmux-api / clipboard-upload. Deploy is
|
||||
# MANUAL via that script — there is no CI pipeline (the lobby's
|
||||
# .woodpecker.yml was removed under ADR-0002, issue #31; it builds no
|
||||
# image, so it is not part of the GHA->ghcr fleet). This stack only owns
|
||||
# the Kubernetes side: Services, Endpoints pointing at
|
||||
# 10.0.10.10:{7681,7682,7683,7684}, the IngressRoutes, and the Traefik
|
||||
# middlewares that gate everything behind Authentik forward-auth.
|
||||
#
|
||||
|
|
|
|||
|
|
@ -344,6 +344,31 @@ resource "kubernetes_manifest" "middleware_tripit_rate_limit" {
|
|||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
|
||||
# Health-specific rate limit. The redesigned, data-dense SPA loads the shell
|
||||
# (JS chunks + two self-hosted Geist woff2) plus a 5-8 call API burst per page,
|
||||
# and fast tab-to-tab navigation from one client IP blows past the default
|
||||
# 10/50 limiter — 429ing the tail so cards/pages render empty (fifth instance
|
||||
# of the burst pattern, after ha-sofia, ActualBudget, noVNC and tripit). Burst
|
||||
# absorbs a couple of full page loads back-to-back.
|
||||
resource "kubernetes_manifest" "middleware_health_rate_limit" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "health-rate-limit"
|
||||
namespace = kubernetes_namespace.traefik.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
rateLimit = {
|
||||
average = 100
|
||||
burst = 1000
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
depends_on = [helm_release.traefik]
|
||||
}
|
||||
|
||||
# Compress responses to clients at the entrypoint level (outermost).
|
||||
# Applied at websecure entrypoint so all responses get compressed.
|
||||
# Uses includedContentTypes (whitelist) instead of excludedContentTypes:
|
||||
|
|
|
|||
|
|
@ -125,6 +125,11 @@ locals {
|
|||
# (older images crash-loop on the unknown enum) — landed after that
|
||||
# image rolled out, same hold-order as FARE/CALENDAR/RESEARCH above.
|
||||
CITY_IMAGE_PROVIDER = "wikipedia"
|
||||
# Re-applied 2026-06-14: a69847a0 (the commit that added this) was never
|
||||
# terraform-applied — its CI run skipped the tripit stack (changed-stack
|
||||
# diff race), so the env never landed in-cluster and the provider fell back
|
||||
# to the fake 1x1-PNG, leaving every trip/stay cover blank. This touch forces
|
||||
# the tripit stack to re-apply and reconcile the drift.
|
||||
# Tour-guide content pipeline (tripit#24/#25): these three default to `fake`
|
||||
# in tripit's config, which is what shipped dark on 2026-06-08 — prod only
|
||||
# ever showed the placeholder "Sight 1". Real providers: Wikipedia GeoSearch
|
||||
|
|
|
|||
|
|
@ -6,5 +6,5 @@ variable "tls_secret_name" {
|
|||
variable "image_tag" {
|
||||
type = string
|
||||
default = "latest"
|
||||
description = "tuya_bridge image tag pushed to forgejo.viktorbarzin.me/viktor/tuya_bridge. Each Woodpecker run does `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)."
|
||||
description = "tuya_bridge image tag at ghcr.io/viktorbarzin/tuya_bridge (built by GHA, ADR-0002). The GHA deploy job drives a Woodpecker `kubectl set image` to the 8-char git SHA; this variable is only used on initial create / TF recreate (image is in lifecycle.ignore_changes)."
|
||||
}
|
||||
|
|
|
|||
29
stacks/uptime-kuma/CONTEXT.md
Normal file
29
stacks/uptime-kuma/CONTEXT.md
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
# Uptime Kuma — Context
|
||||
|
||||
Glossary for the uptime-kuma monitoring context. Terms only — no implementation
|
||||
detail. Decisions live in `docs/adr/`.
|
||||
|
||||
## Glossary
|
||||
|
||||
**Active check (poll)** — Uptime Kuma actively probes a target on an interval
|
||||
(HTTP / TCP / ping / DB). This is *polling*, not "scraping." Prometheus *scrapes*
|
||||
exporters; Kuma *polls* targets. (Note: Prometheus does **not** scrape Kuma — a
|
||||
separate monitoring lane.)
|
||||
|
||||
**Monitor** — one configured target plus its check definition.
|
||||
|
||||
**Internal monitor** — probes a service on its in-cluster address
|
||||
(`*.svc.cluster.local`). Answers "is the service itself healthy?"
|
||||
|
||||
**`[External]` monitor** — probes a service via its full public path
|
||||
(DNS → Cloudflare → cloudflared tunnel → Traefik). Answers "is the service
|
||||
reachable the way users reach it?" Maintained one-per-externally-reachable-service
|
||||
by deliberate choice (see ADR-0001).
|
||||
|
||||
**Heartbeat** — one recorded check result (up/down + latency), persisted to the
|
||||
datastore.
|
||||
|
||||
**External-access divergence** — the condition where a service is healthy
|
||||
*internally* but its `[External]` path is down — i.e. the shared
|
||||
Cloudflare/tunnel/Traefik path is broken while the service itself is fine.
|
||||
Surfaced by the `ExternalAccessDivergence` alert.
|
||||
|
|
@ -0,0 +1,45 @@
|
|||
# ADR-0001: Uptime Kuma is intentionally lean — sizing & placement
|
||||
|
||||
## Status
|
||||
Accepted (2026-06-13)
|
||||
|
||||
## Context
|
||||
A review was prompted by a suspicion that Kuma was "scraping too much / causing
|
||||
unnecessary traffic," itself triggered by a socket.io login-timeout incident on
|
||||
the monitor-sync CronJobs. Measured state at review time:
|
||||
|
||||
- **227 active monitors**; 209 of them at 300s intervals; **~1 check/sec** aggregate.
|
||||
- Datastore: the **shared `mysql.dbaas`** (MariaDB), **~77 MB**, ~1 heartbeat
|
||||
write/sec, 30-day retention.
|
||||
- **122 `[External]` monitors** (full public path) + ~105 internal.
|
||||
|
||||
The data did **not** support a load problem — Kuma is already lean. The
|
||||
login-timeout incident was a Kuma 2.x socket.io quirk (kuma's single Node event
|
||||
loop briefly stalling), fixed separately by wrapping login in a retry — not a
|
||||
load issue.
|
||||
|
||||
## Decisions
|
||||
1. **Keep Kuma as-is; do not reflexively cut monitors or intervals.** Poll rate
|
||||
(~1/s) and DB footprint (77 MB) are modest.
|
||||
2. **`[External]` monitors stay per-service** (one per externally-reachable
|
||||
service), **not** a small canary set. Rejected cutting to ~6-10 canaries:
|
||||
although the Cloudflare → tunnel → Traefik path is shared infra that fails as a
|
||||
unit, per-service external probes also catch *single-service* external
|
||||
misconfig (one service's DNS / auth carve-out / route), which canaries miss.
|
||||
The ~35k Cloudflare requests/day this generates is accepted for that coverage.
|
||||
3. **Datastore stays on the shared `mysql.dbaas`.** Rejected moving to
|
||||
self-contained SQLite or a dedicated DB. The coupling — Kuma depends on the
|
||||
single-instance MySQL it also helps monitor, including during that MySQL's
|
||||
8.4.9 wipe-maintenance (bead code-963q) — is acknowledged but accepted as
|
||||
low-impact for now.
|
||||
|
||||
## Consequences
|
||||
- All three decisions are **cheap to reverse**; revisit if measured load on
|
||||
`mysql.dbaas` or Cloudflare ever becomes a real (not gut-feel) problem. This
|
||||
ADR exists mainly so that review isn't re-run from scratch.
|
||||
- **Known gap:** the *internal* monitor-sync creates/updates monitors but does
|
||||
**not** prune orphans (the external sync does). Internal monitors for deleted
|
||||
services linger and need periodic manual cleanup — e.g. the stale
|
||||
"Goldilocks (VPA)" monitor (target removed with VPA on 2026-06-12) was deleted
|
||||
by hand on 2026-06-13. A *scoped* internal-prune (only deleting monitors the
|
||||
sync owns, never hand-made ones) is a possible future improvement.
|
||||
|
|
@ -503,8 +503,27 @@ except (urllib.error.URLError, OSError, KeyError, ValueError) as e:
|
|||
|
||||
print(f"Loaded {len(targets)} external monitor targets (source={source})")
|
||||
|
||||
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
|
||||
api.login("admin", UPTIME_KUMA_PASS)
|
||||
api = None
|
||||
for _login_try in range(1, 6):
|
||||
try:
|
||||
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
|
||||
api.login("admin", UPTIME_KUMA_PASS)
|
||||
break
|
||||
except Exception as _login_err:
|
||||
# kuma 2.x's single Node event loop intermittently stalls under its
|
||||
# ~300 monitors, so the socket.io login handshake times out. Retry a
|
||||
# few times across a ~60s window to ride out the stall instead of
|
||||
# failing the whole sync job (which fired JobFailed -> Slack noise).
|
||||
print(f"WARN: Kuma login attempt {_login_try}/5 failed: {_login_err!r}")
|
||||
if api is not None:
|
||||
try:
|
||||
api.disconnect()
|
||||
except Exception:
|
||||
pass
|
||||
api = None
|
||||
if _login_try == 5:
|
||||
raise
|
||||
time.sleep(15)
|
||||
|
||||
monitors = api.get_monitors()
|
||||
existing_external = {}
|
||||
|
|
@ -818,8 +837,27 @@ UPTIME_KUMA_PASS = os.environ["UPTIME_KUMA_PASSWORD"]
|
|||
with open("/config/targets.json") as f:
|
||||
targets = json.load(f)
|
||||
|
||||
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
|
||||
api.login("admin", UPTIME_KUMA_PASS)
|
||||
api = None
|
||||
for _login_try in range(1, 6):
|
||||
try:
|
||||
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
|
||||
api.login("admin", UPTIME_KUMA_PASS)
|
||||
break
|
||||
except Exception as _login_err:
|
||||
# kuma 2.x's single Node event loop intermittently stalls under its
|
||||
# ~300 monitors, so the socket.io login handshake times out. Retry a
|
||||
# few times across a ~60s window to ride out the stall instead of
|
||||
# failing the whole sync job (which fired JobFailed -> Slack noise).
|
||||
print(f"WARN: Kuma login attempt {_login_try}/5 failed: {_login_err!r}")
|
||||
if api is not None:
|
||||
try:
|
||||
api.disconnect()
|
||||
except Exception:
|
||||
pass
|
||||
api = None
|
||||
if _login_try == 5:
|
||||
raise
|
||||
time.sleep(15)
|
||||
|
||||
existing = {m["name"]: m for m in api.get_monitors()}
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue