Merge forgejo/master: reconcile diverged lineages [ci skip]

Local checkout carried the 2026-06-10 DNS/registry architecture series
(pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes
stock) + vzdump/nfs-mirror/workstation-rebuild commits that never
reached the canonical remote, while forgejo master received the
emo-access series via isolated worktrees. Viktor asked to merge.

Conflict resolutions (newest iteration wins in each file):
- stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert
  after live retention orphaned OCI indexes; remote had 06-09 enable)
- .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final
  registry/DNS architecture + implemented vzdump alerts
- scripts/workstation/setup-devvm.sh: LOCAL — pinned-version,
  reproducible-rebuild refactor (kubelogin pin, restructured staging)
- scripts/workstation/managed-settings.json: FORGEJO — the
  allow-then-audit claudeMd (matches /etc deployment byte-for-byte)
- scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone
  intact

[ci skip]: all stack changes in the local lineage were applied live
this morning — CI would re-walk 100+ stacks via the modules/ fallback
for zero state change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-06-10 15:21:50 +00:00
commit 8cfd0e5e5c
10 changed files with 160 additions and 5 deletions

View file

@ -197,6 +197,34 @@ steps:
- Keeps Woodpecker global secrets in sync with Vault
- Runs in `woodpecker` namespace
## Infra repo CI (Woodpecker repo 82 — Forgejo forge)
The infra repo itself runs on Woodpecker via the **Forgejo** forge (repo id 82,
registered 2026-06-08; the GitHub-side repo id 1 also remains registered).
Pushes to `master` fire `.woodpecker/default.yml` (changed-stacks terragrunt
apply) plus the `notify-nonadmin-push` Slack audit step (allow-then-audit
contribution model — see `multi-tenancy.md`). Operational facts (2026-06-10):
- **Webhook URL is the IN-CLUSTER service**: `http://woodpecker-server.woodpecker.svc.cluster.local/api/hook?...`
(PATCHed via the Forgejo API). The Woodpecker-generated default
(`https://ci.viktorbarzin.me/...`) resolves to the non-proxied public A
record from pods → NAT hairpin → intermittent `context deadline exceeded`,
silently dropping push events (found when a push produced no pipeline).
If Woodpecker ever "repairs" the repo it will rewrite the hook back to
`ci.viktorbarzin.me` — re-apply the in-cluster URL (or pin `ci.viktorbarzin.me`
in the CoreDNS pod carve-out alongside forgejo).
- **Repo-scoped secrets must exist on BOTH repos**: pipelines reference
repo-level secrets (`registry_ssh_key`, `pve_ssh_key`, `CLOUDFLARE_TOKEN`,
…). Repo 82 was registered without them and every all-workflow compile
errored with `secret "registry_ssh_key" not found`. Fixed by cloning repo-1
rows to repo 82 in the Woodpecker DB (`insert into secrets … select … where
repo_id=1`). When registering a new forge repo for infra, clone the secret
set too.
- **Empty commits defeat path filters**: a commit with no changed files makes
Woodpecker include ALL workflow files (path conditions can't exclude), so
every repo secret must resolve. Normal commits with real files only compile
the matching workflows.
## Decisions & Rationale
### Why GitHub Actions + Woodpecker?

View file

@ -543,9 +543,17 @@ Separate from the in-cluster namespace-owner model above, the **devvm** (`10.0.1
**Config inheritance (live):** wizard authors the base (his chezmoi-versioned `~/.claude`). Two native layers carry it to every user — the enforced org `claudeMd` in `/etc/claude-code/managed-settings.json` (top precedence, all sessions) and per-user `~/.claude/{skills,rules,…}` **symlinks** to the base (seeded via `/etc/skel`; edits propagate live). Secrets stay per-user at mode 600, never symlinked.
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo at `~/code` — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. Changes are ungated (push ≠ apply); the real boundary is apply-time (`scripts/tg apply` needs an admin Vault token + cluster RBAC).
**Infra access:** non-admins get their own **writable, git-crypt-LOCKED** clone of the (public) infra repo at `~/code` — code/docs plaintext, secret files (`*.tfvars`, `secrets/**`) stay ciphertext. The provisioner clones anonymously from the public GitHub mirror; **contribute access is wired per-user on top** (see below). The apply boundary still holds (`scripts/tg apply` needs an admin Vault token + cluster RBAC), but **pushing `master` is NOT inert** — the Forgejo→Woodpecker webhook fires `.woodpecker/default.yml` (`event: push, branch: master`, `require_approval: forks` only), which terragrunt-applies changed stacks. `master` is **branch-protected on Forgejo** (force-push disabled for everyone — history is append-only; push + merge whitelists = `viktor` + explicitly granted users, deploy keys allowed). **Allow-then-audit (Viktor, 2026-06-10):** `ebarzin` (emo) is on the whitelist and pushes straight to `master` — no PR gate. The tracking burden moves to: (a) **commit messages that record what + why** (the agent instructions in AGENTS.md and the managed claudeMd require the body to paraphrase the user's request), (b) the **`notify-nonadmin-push` Slack audit step** in `.woodpecker/default.yml` — every master push by a non-admin author is posted to Slack (admin pushes are not), and (c) non-admins **never use `[ci skip]`** so every change fires the pipeline (and thus the audit feed). Users NOT on the whitelist fall back to `<user>/<topic>` branches + PRs. **Clones stay fresh automatically** (2026-06-10): the hourly `t3-provision-users` reconcile runs `refresh_locked_clone` (fetch all remotes + fast-forward `master`, ONLY when on master with a clean tree and an upstream — dirty trees and local commits are left alone with a WARN), and `start-claude.sh` does the same freshen at session launch (15s-capped so an offline remote never stalls the session).
**Status (2026-06-08):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, **per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), and the Authentik `T3 Users` edge gate (applied + verified)**. **Remaining (held / future):** the emo cutover to his own locked clone (Phase 5), the offboarding apply-side (Phase 7), per-user MCP/auth injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
**Contribute access (per non-admin, manual — the anca/tripit PAT precedent):**
1. Add their Forgejo user as a **write** collaborator on `viktor/infra` (`PUT /api/v1/repos/viktor/infra/collaborators/<login>`).
2. Mint a PAT — the admin REST endpoint 404s here, use the in-pod CLI: `kubectl -n forgejo exec deploy/forgejo -- su -s /bin/sh git -c "forgejo admin user generate-access-token --username <login> --token-name devvm-infra-git --scopes 'write:repository'"`.
3. Install it in their `~/.git-credentials` (`https://<login>:<token>@forgejo.viktorbarzin.me`, mode 600) + `git config --global credential.helper store`, set `user.name`/`user.email`.
4. In their clone: `git remote add forgejo https://forgejo.viktorbarzin.me/viktor/infra.git` and `git branch --set-upstream-to=forgejo/master master` (origin stays the anonymous GitHub mirror).
5. (Optional — Viktor's call per user) Grant direct master push: add their login to the `master` branch-protection push + merge whitelists (`PATCH /api/v1/repos/viktor/infra/branch_protections/master`). Done for `ebarzin` 2026-06-10.
6. Verify: branch push succeeds; a `master` push succeeds for whitelisted users and is rejected with `Not allowed to push to protected branch` otherwise.
**Status (2026-06-10):** built + verified on the live host — capacity (8 GiB swap), config inheritance, roster-driven provisioner, per-user locked clone, per-user OIDC kubeconfig + the `oidc-power-user-readonly` ClusterRole + emo's `k8s_users` entry (applied + impersonation-verified), the Authentik `T3 Users` edge gate, **the emo Phase-5 cutover (own clone + launcher repoint + `code-shared` removal, completed 2026-06-10) and emo's contribute access (`ebarzin` write collaborator + PAT + protected `master`)**. Per the live `/etc/skel` design, non-admin `~/.claude/{rules,skills}` symlinks into the admin base are **kept** (they ARE the shared-base delivery mechanism — the plan's step to remove them is obsolete). **Remaining (held / future):** the offboarding apply-side (Phase 7), per-user MCP/auth injection, and roster-reconciled `T3 Users` membership. See `../runbooks/offboard-user.md` for deprovisioning.
## Related

View file

@ -166,6 +166,8 @@ Design principle: **every bit of devvm setup is an idempotent git script** — n
- **ADR-0002 — devvm Linux users, not K8s ephemeral pods.** Re-platforming is overkill at this scale; config-push is easier on one host.
- **ADR-0003 — Config inheritance via native machine-wide layers + per-user override.** Rejected: periodic sync, OverlayFS (no live lowerdir edits), Nix (rebuild not live).
- **ADR-0004 — Infra access via per-user writable git-crypt-locked clones (changes ungated).** Each non-admin gets their own writable, keyless (locked) clone — read + edit + push freely, no PR gate. Safe because infra apply is manual + admin-only (push ≠ apply, id=4355) and the clone can't decrypt secrets. Rejected: the shared read-only mirror (gated changes) and the shared unlocked tree (secret leak + commit entanglement). Trade: repo-local CLAUDE.md updates via pull, not live (global config inheritance stays live via §4).
- **AMENDED 2026-06-10 — the "push ≠ apply" premise was WRONG.** The Forgejo→Woodpecker webhook on `viktor/infra` fires `.woodpecker/default.yml` on `push` to `master` (`require_approval: forks` only), which terragrunt-applies changed stacks — so an ungated master push IS a deploy. Enforcement added instead of dropping the ADR: Forgejo **branch protection on `master`** (push + merge whitelists = `viktor`, deploy keys allowed). Non-admins keep free branch pushes + PRs; only admin merges land on master. "No PR gate" is thereby reversed for non-admins; the rest of the ADR (per-user locked clones) stands. As-built: `../architecture/multi-tenancy.md` → "Contribute access".
- **AMENDED AGAIN 2026-06-10 (later) — allow-then-audit.** Viktor granted emo (`ebarzin`) direct master push ("he's allowed to make any change; what matters is tracking what changed and why"). The PR gate is dropped FOR WHITELISTED USERS; tracking is enforced instead: agent-written commit messages must carry the user's plain-language intent (the WHY), a `notify-nonadmin-push` Slack step in `.woodpecker/default.yml` surfaces every non-admin master push, `[ci skip]` is forbidden for non-admins, and force-push stays disabled (append-only history). Accepted consequence: emo's pushes auto-apply changed stacks via CI. Branch protection + the PR fallback remain for non-whitelisted users.
- **ADR-0005 — Power-user = cluster-wide read-only (no Secrets), via a NEW dedicated ClusterRole.** Re-widens cross-tenant READ for the trusted power-user tier only — but via a NEW `oidc-power-user-readonly` ClusterRole (get/list/watch, NO `secrets`), NOT the existing `oidc-power-user` (which grants read+write+Secrets and is unbound). Bound to the user's OIDC identity (kubelogin) — the apiserver accepts Authentik OIDC for the `kubernetes` audience; the dashboard's SA-token pattern is for the dashboard UI only.
- **ADR-0006 — The roster is the single source of truth for the FULL lifecycle.** `roster.yaml` drives onboard *and* offboard; `/etc/ttyd-user-map`, `dispatch.json`, and Authentik `T3 Users` membership are *derived* from it, and tier is *validated* against `k8s_users` (fail-loud on mismatch). Rejected: hand-maintaining the four membership lists in parallel (guaranteed drift). Offboarding is first-class + staged (reversible cut → cluster revoke → gated `userdel`), not an afterthought.
- **ADR-0007 — Add swap + a capacity budget to the devvm before onboarding active users.** A shared 24 GB / **0-swap** host OOM-kills live sessions under multi-user load (wizard alone runs ~20). Swap + a max-concurrent ceiling are prerequisites, not follow-ups.

View file

@ -171,6 +171,8 @@ users:
### Task 5.1: Cut emo over to his own writable locked clone (opt-in, reversible)
> **DONE 2026-06-10** (staged across 06-08 → 06-10), with two deviations: (1) step 4(c) **skipped deliberately** — the live `/etc/skel` shared base delivers `~/.claude/{rules,skills}` AS symlinks into the admin base, so emo's existing symlinks match the as-built design and were kept; (2) push access was **added** (not in this plan): `ebarzin` = write collaborator on Forgejo `viktor/infra` + PAT in `~/.git-credentials` + `forgejo` remote, with `master` branch-protected (see ADR-0004 amendment — push to master auto-applies via Woodpecker, so it is whitelist-gated to `viktor`). Verified: branch push OK, master push rejected, `code-shared` removed, admin tree unreadable as emo.
**Files:** none (host state; an explicit one-time action — NOT the routine reconcile)
- [ ] **Step 1: Prereqs.** Confirm emo inherits config (Phase 1) + has his scoped kubeconfig (Phase 2). (Phase 3 deliberately SKIPPED emo — his clone is created *here*.)

View file

@ -29,7 +29,26 @@ gated `userdel_archive`, which is **never** auto-applied).
sudo systemctl disable --now t3-serve@<os_user>.service
sudo passwd -l <os_user>
```
4. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302 → Authentik, then
4. **Revoke git + group access** *(manual)*:
```bash
# legacy secret-bearing group, if they were ever in it
sudo gpasswd -d <os_user> code-shared
# drop write access to the infra repo
curl -X DELETE -H "Authorization: token <admin_pat>" \
https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/collaborators/<forgejo_login>
# if they were whitelisted for direct master push, remove them from the
# branch-protection whitelists (PATCH with the remaining usernames)
curl -X PATCH -H "Authorization: token <admin_pat>" -H 'Content-Type: application/json' \
https://forgejo.viktorbarzin.me/api/v1/repos/viktor/infra/branch_protections/master \
-d '{"push_whitelist_usernames":["viktor"],"merge_whitelist_usernames":["viktor"]}'
# revoke their devvm git PAT (token name: devvm-infra-git; admin PAT may
# manage other users' tokens — verified 2026-06-10; the CLI has no delete)
curl -X DELETE -H "Authorization: token <admin_pat>" \
https://forgejo.viktorbarzin.me/api/v1/users/<forgejo_login>/tokens/devvm-infra-git
```
Note: their already-running sessions keep dropped groups until cycled — restart
`t3-serve@<os_user>` to enforce immediately.
5. **Verify:** they can no longer reach `t3.viktorbarzin.me` (302 → Authentik, then
denied once removed from the `T3 Users` group — Part C) and cannot log in. Nothing
is deleted; re-adding the roster entry + reconcile fully restores them.