Commit graph

17 commits

Author SHA1 Message Date
Viktor Barzin
4af5eff043 docs(multi-tenancy): note the on-demand web restore button
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The tmux-persist paragraph only described the boot-time restore. Document the
new manual path — the web terminal's "Restore sessions" button (tmux-api
POST /restore -> tmux-restore-user wrapper -> `tmux-persist restore <user>`) —
and why it exists: an OOM that kills a user's tmux server WITHOUT a reboot
never triggers the boot-only restore service, which is the common case under
multi-user memory pressure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 21:22:41 +00:00
Viktor Barzin
2825cb1703 workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit)
Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone
itself; he wants ~/code to be the directory where all her project repos
(tripit etc.) live side by side, with infra moved to a subdirectory.

- roster.yaml gains per-user 'code_layout: single|workspace' + 'repos',
  validated + derived by roster_engine.py (12 new tests, 40 total).
- t3-provision-users reconcile: auto-migrates a single-layout ~/code to
  ~/code/infra (running processes follow the moved inode), hoists nested
  project clones to the workspace root, clones roster repos from Forgejo
  AS the user (their PAT makes private repos work), and wires the
  documented forgejo remote + forgejo/master upstream into clones that
  predate that contract.
- Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS
  read, shifting later fields left (groups was only safe by being the
  last field) — emit '-' sentinels instead.
- start-claude.sh session freshen is layout-aware (freshens each repo
  under ~/code for workspace users).
- managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md
  updated in the same change.

Applied live: ancamilea = workspace (infra at ~/code/infra, her existing
tripit clone hoisted to ~/code/tripit, master upstream switched to
forgejo/master); emo stays single layout, untouched. [ci skip]
2026-06-10 18:05:31 +00:00
Viktor Barzin
e7fbf986fb workstation: rename tmux persistence out of the t3 namespace [ci skip]
Viktor's correction: this feature is about the tmux web-terminal
sessions, not t3 — t3 auto-saves its own threads (~/.t3 state +
daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist
(units tmux-persist-save.timer / tmux-persist-restore.service, state
/var/lib/tmux-persist), header rescoped to say exactly that. Same
mechanism, correct taxonomy. Old units removed, state migrated,
re-verified live (5 emo + 3 wizard sessions snapshotted).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:42:52 +00:00
Viktor Barzin
2e4f48f3fc workstation: tmux sessions survive devvm reboots (save timer + boot restore)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor: emo's open web-terminal sessions must persist across reboots.
Claude conversations were already durable on disk; the volatile part
was the tmux wiring (which named session runs which conversation).

t3-tmux-sessions save (5-min timer) snapshots every roster user's
sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid
taken from argv --resume (self-sustaining once restored) or the
newest transcript in the cwd-slug project dir created after process
start (fresh launcher sessions; claude does NOT hold its transcript
fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore
(boot oneshot, also safe after partial loss) recreates missing
sessions with claude --resume <uuid>. Reconciler self-heals both
units' enablement.

Verified live: emo's 5 sessions snapshotted with correct uuids;
killed R730-cooling -> restore brought it back resuming the same
conversation (context meter identical); other sessions untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:39:32 +00:00
Viktor Barzin
35c89fa90c workstation: managed Claude config self-deploys from the repo [ci skip]
Viktor's claudeMd edits must keep reaching every user now that emo is
out of the shared tree. Two reconciler additions:
- sync_managed_config: installs scripts/workstation/managed-settings.json
  to /etc/claude-code whenever the repo copy changes — editing the
  org claudeMd is now edit + commit, no manual install step
- refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md
  (static mirror of the claudeMd; header-guarded so user-customized
  files are never clobbered)

Verified live: corrupted emo's mirror -> reconcile restored it;
wizard's stale mirror refreshed; in-sync managed config no-ops.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:03:24 +00:00
Viktor Barzin
a49d1eadf6 workstation: emo direct master push — allow-then-audit [ci skip]
Viktor: emo may make any change; what matters is tracking what changed
and why. ebarzin added to master push+merge whitelists (force-push
stays disabled — append-only history). Tracking enforced three ways:
- agent instructions (managed claudeMd + AGENTS.md): commit body MUST
  carry the user's plain-language intent; commits land on master
  directly; [ci skip] forbidden for non-admins
- new notify-nonadmin-push step in .woodpecker/default.yml: Slack
  message for every non-admin master push (admin pushes silent)
- PR flow remains the fallback for non-whitelisted users

Accepted consequence (informed): emo's pushes auto-apply changed
stacks via CI. Offboard runbook gains whitelist-removal step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:53:43 +00:00
Viktor Barzin
2e5af5dc0e workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip]
Non-admins (emo) need current master without manual pulls. Two layers:
- t3-provision-users reconcile gains refresh_locked_clone: fetch all
  remotes + ff-only master, guarded (on master, clean tree, upstream
  set); dirty/diverged clones are left alone with a WARN.
- start-claude.sh freshens ~/code at session launch, 15s-capped so an
  offline remote never delays the session.

Verified live on emo's clone: stale clone ff'd to tip by the
reconciler; launcher snippet ff's when clean and refuses while a
dirty file exists. Deployed to /usr/local/bin/t3-provision-users,
/etc/skel/start-claude.sh, and emo's launcher.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:41:38 +00:00
Viktor Barzin
5d9417fbaa workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip]
ADR-0004's premise was wrong: pushing master fires the Woodpecker apply
pipeline (require_approval=forks only), so master pushes ARE deploys.
Added Forgejo branch protection on master (push/merge whitelist=viktor,
deploy keys allowed); non-admins contribute via branches + PRs.

emo (ebarzin): write collaborator on viktor/infra, PAT in
~/.git-credentials, forgejo remote + upstream in his locked clone.
Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they
ARE the skel shared-base mechanism — plan step 4c obsolete).
Offboard runbook: revoke PAT + collaborator + group steps added.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:30:41 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
37626cb89b workstation: docs — mark RBAC + Authentik gate applied [ci skip]
multi-tenancy.md + service-catalog.md status: per-user OIDC kubeconfig, oidc-power-user-readonly ClusterRole, emo k8s_users entry, and the Authentik T3 Users edge gate are now applied + verified. Remaining: emo cutover (Phase 5, held), offboarding apply-side (Phase 7), per-user MCP injection, roster-reconciled group membership.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:51:44 +00:00
Viktor Barzin
c611ecf84d workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip]
multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:27:17 +00:00
Viktor Barzin
8f13fdeaf7 docs: dashboard SA cluster-read tightened to namespace-list + nodes only [ci skip]
Reflect the dashboard-nav-readonly ClusterRole: namespace-owners can list
namespaces/nodes (for dashboard nav) but not read other tenants' resources.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:11 +00:00
Viktor Barzin
c4bd64f88a docs: dashboard now auto-injects per-user SA token (no token-paste)
Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to
reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the
extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map
regen). [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
8e44ccaa65 docs: dashboard access is forward-auth + token-paste (OIDC SSO blocked)
Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality:
apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard
uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste.
Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state,
and add-user skill (onboarding now documents the dashboard token). oauth2-proxy
+ k8s-dashboard OIDC app noted as idle. [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:10 +00:00
Viktor Barzin
ad3432d685 docs(k8s-dashboard): dashboard SSO as-built (Option B multi-issuer apiserver)
Update authentication.md (structured multi-issuer AuthenticationConfiguration
+ dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state
(new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth),
and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config →
re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint
pivot from the original dual-aud approach. [ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-05 09:19:09 +00:00
Viktor Barzin
5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00