Commit graph

4162 commits

Author SHA1 Message Date
Viktor Barzin
7330cb6a0b backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
3e7093947d t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip]
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
dacd9d2d8a t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
baac46415f t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
41c11216da t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
e0ab621cb2 workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
39e35ca8c9 workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN)
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
1edccedb1f workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
87702bdce8 feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
edaee13be3 docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip] 2026-06-09 21:41:53 +00:00
Viktor Barzin
4b44db36da workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model)
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
64413c76ce workstation: default Claude model = claude-fable-5 for all devvm users
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
93ec0c66fd docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip] 2026-06-09 21:41:53 +00:00
90b8312a29 tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials
Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]
2026-06-09 21:41:53 +00:00
Viktor Barzin
e0452611b5 forgejo: survive CI-build registry-push storms (mem 3Gi + working retention)
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):

- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
  registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
  kept OOMing against. Size for the push spike.

- Activate registry retention (DRY_RUN false). Verified the delete list
  against all running viktor/* images first: 0 running images affected.
  Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.

- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
  scopes container packages per-user, so DELETE on viktor/* returned 403 (the
  dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
  viktor's write:package PAT. Retention had never actually worked.

- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
  gentler-builds layer cache survives daily pruning.

[ci skip] — already applied via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
bc37b16815 backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:30:19 +00:00
Viktor Barzin
83f418159a backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:30:19 +00:00
Viktor Barzin
7fc4caefe3 t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip]
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
bccaa08d8e t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
5ea238c707 t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
2125651aaa t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
fad10a8707 workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
eeadf0f85d workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN)
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
fbcc330214 workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
48013a4a92 feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
b1a6391a4d docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip] 2026-06-09 19:41:08 +00:00
Viktor Barzin
68a237faf7 workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model)
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 19:35:29 +00:00
Viktor Barzin
64f405db36 workstation: default Claude model = claude-fable-5 for all devvm users
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:31:27 +00:00
8eb0bb244f docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip] 2026-06-09 18:20:54 +00:00
1f23ba6929 tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials
Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]
2026-06-09 18:18:13 +00:00
Viktor Barzin
c5bda77731 forgejo: survive CI-build registry-push storms (mem 3Gi + working retention)
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):

- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
  registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
  kept OOMing against. Size for the push spike.

- Activate registry retention (DRY_RUN false). Verified the delete list
  against all running viktor/* images first: 0 running images affected.
  Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.

- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
  scopes container packages per-user, so DELETE on viktor/* returned 403 (the
  dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
  viktor's write:package PAT. Retention had never actually worked.

- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
  gentler-builds layer cache survives daily pruning.

[ci skip] — already applied via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 14:36:17 +00:00
Viktor Barzin
1e6e5c4ee9 t3code: enable t3-autoupdate.timer from the hourly provisioner
The unit files (t3-autoupdate.{timer,service,sh}) were committed but nothing
ever enabled the timer, so it sat `disabled` and every t3-serve@ instance
silently froze on an old t3 build (all users were on v0.0.24 while nightly was
0.0.25-nightly.20260608). Enable it from the hourly reconciler (not the
once-at-provision setup-devvm.sh) so it self-heals if ever disabled again.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 14:09:55 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
05b50d2b96 workstation: v2 membership design — Authentik-group-driven, email-identified [ci skip]
Supersedes the MEMBERSHIP model of the 2026-06-07 design (roster.yaml SSoT). Key principle: workstation access (T3 Users group membership) is decoupled from cluster authorization (k8s_users + kubernetes-* groups, untouched). A user is defined once in Authentik: email + T3 Users membership + optional os_user attribute. Provisioner reconciles accounts from the Authentik API; roster.yaml retires. v1 foundation (config inheritance, locked clone, kubeconfig, swap, hardening, emo cutover) unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 07:19:25 +00:00
Viktor Barzin
98fe65e345 storage: migrate priority-pass uploads off proxmox-lvm-encrypted to NFS (Phase 1)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Boarding-pass images, no embedded DB. Drops LUKS-at-rest (low-sensitivity, accepted).
21.8M copied + verified on NFS; pod 2/2 on NFS; frees one proxmox-csi slot.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 18:47:07 +00:00
Viktor Barzin
06f5c12476 workstation: setup-devvm.sh hardens the admin's unlocked tree (o-rx, not world-readable)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Codifies the leak fix found during the emo cutover: /home/wizard/code is git-crypt-DECRYPTED in the admin's working tree, but was mode 0775 (o+rx) — so any devvm user (even outside code-shared) could read decrypted secrets by path (verified: emo read certificate.pfx as plaintext DER). setup-devvm.sh now chmod o-rx the admin tree so a rebuild keeps it. Live fix already applied (now drwxrws---).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 18:08:52 +00:00
Viktor Barzin
37626cb89b workstation: docs — mark RBAC + Authentik gate applied [ci skip]
multi-tenancy.md + service-catalog.md status: per-user OIDC kubeconfig, oidc-power-user-readonly ClusterRole, emo k8s_users entry, and the Authentik T3 Users edge gate are now applied + verified. Remaining: emo cutover (Phase 5, held), offboarding apply-side (Phase 7), per-user MCP injection, roster-reconciled group membership.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:51:44 +00:00
Viktor Barzin
5c378dd5e3 workstation: gate t3.viktorbarzin.me to the T3 Users group (Phase 4)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
New authentik_group 'T3 Users' (members wizard/emo/ancamilea via data lookups — usernames ARE their emails in this Authentik instance) + a branch in the admin-services-restriction expression policy gating t3.viktorbarzin.me to that group, placed BEFORE the ADMIN_ONLY_HOSTS early-return. Surgical two-step targeted apply (group-with-members first, then the gate) → zero lock-out window. Verified: group has all 3 members, the live policy contains the t3 branch, t3 still 302s to Authentik. Membership is HCL for now (FUTURE: roster-reconciled via the Authentik API).

Note: the authentik stack had 3 unrelated pending drift changes (pgbouncer deployment + 2 tls_secrets) — deliberately NOT applied (targeted apply isolated this change; left for the stack owner).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:50:40 +00:00
Viktor Barzin
173b1fc116 workstation: per-user OIDC kubectl — power-user-readonly RBAC + kubeconfig (Phase 2.2)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged.

Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:47:00 +00:00
Viktor Barzin
c611ecf84d workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip]
multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:27:17 +00:00
Viktor Barzin
08bf1e0a3a workstation: per-user writable git-crypt-locked infra clone (Phase 3.1)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
install_locked_clone: non-admins get their OWN ~/code = a keyless clone of the public infra repo (the monorepo has no remote, so the locked clone is of infra). filter.git-crypt=cat + --no-checkout ⇒ code/docs plaintext, secret files (*.tfvars/*.tfstate/secrets/**) stay \0GITCRYPT\0 ciphertext. Writable + ungated (push != apply). Skip-if-exists ⇒ never touches emo's existing ~/code symlink (gated cutover handles that). Verified live on ancamilea: secrets ciphertext, code plaintext, commit works, emo untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:23:57 +00:00
Viktor Barzin
2c1865eabb workstation: roster-driven provisioner (SSoT reconcile, additive-only)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
t3-provision-users.sh now consumes roster_engine.py: derives accounts + per-tier groups + sticky ports + /etc/ttyd-user-map + dispatch.json from roster.yaml and applies them. ADDITIVE-ONLY for existing users (never strips a group, replaces a home, or re-locks an account) so the hourly timer is always safe. Best-effort tier validation vs live k8s_users: warns on a net-new absent user (emo), aborts only on a real tier conflict, skips when root has no Vault token. DRY_RUN mode for safe testing. Verified on the live host: reproduces dispatch.json content exactly, emo/anca groups + all t3-serve instances unchanged, idempotent, shellcheck-clean; deployed to /usr/local/bin (hourly timer target).

Engine: validate_tiers now returns ValidationIssue(severity) — error=conflict (abort) vs warn=absent (grant pending) — + has_blocking_errors(); 28 pytest cases. setup-devvm.sh redeploys the provisioner for reproducibility.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:18:12 +00:00
Viktor Barzin
3feb69e379 workstation: pin verified config-inheritance mechanism in design §4 [ci skip]
Spike GO (claude 2.1.168): managed claudeMd reaches a session; no managed-skills key exists so skills/rules inherit via per-user ~/.claude symlinks to the base (seeded in /etc/skel). Records the settings.json 0664->0600 leak fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:09:13 +00:00
Viktor Barzin
1757cb59e7 workstation: machine-wide config inheritance (managed claudeMd + setup-devvm.sh + skel)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Spike confirmed (claude 2.1.168): /etc/claude-code/managed-settings.json claudeMd reaches a session (sentinel echoed). Hybrid inheritance = enforced org claudeMd machine-wide (top precedence, non-overridable) + per-user ~/.claude/{skills,rules,...} symlinks to the config base (live, the proven emo pattern) seeded via /etc/skel. setup-devvm.sh is idempotent: apt toolset, node>=18 + claude-code, system-wide kubelogin (NOT the Azure apt pkg), the managed config, and /etc/skel (launcher that cd's $HOME/code, tmux UX, inheritance symlinks). Verified: emo unchanged (groups/symlinks/live sessions intact), emo can read the managed config, idempotent re-run clean.

Security fix (host state): /home/wizard/.claude/settings.json was 0664, exposing MEMORY_API_KEY to all devvm users -> chmod 0600. chezmoi source needs a private_ prefix + the key templated out to persist this (dotfiles-repo follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:07:04 +00:00
Viktor Barzin
55d4b4cf2d workstation: correct devvm RAM (8->24GB) + record 8G swap & capacity budget [ci skip]
devvm is the t3code Workstation host. Added an 8 GiB swapfile (swappiness=10, fstab-persisted) to turn multi-user OOM-kills into graceful paging (was 0 swap, ~1.2 GiB free of 23). Capacity budget: ~4-5G RAM per active user, max ~3-4 concurrent active sessions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 13:48:52 +00:00
Viktor Barzin
3033e2c355 workstation: roster source-of-truth + host package manifest [ci skip]
roster.yaml is the single source of truth for the devvm Workstation lifecycle (os_user -> authentik_user/k8s_user/tier/namespaces); wizard listed as admin so the regenerated ttyd-map/dispatch never drops his instance. packages.txt is the declarative apt toolset (non-apt tools — node/claude-code/kubectl/vault/kubelogin — noted with their real install paths; the apt pkg named 'kubelogin' is the wrong Azure tool).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 13:38:20 +00:00
Viktor Barzin
7ab4c1e1e2 workstation: tested roster derivation + offboarding-diff engine [ci skip]
Pure functional core (PRD ViktorBarzin/infra#9 modules #1 roster engine + #5 offboarding diff) that the bash provisioner will consume as JSON: roster parse/validate, fail-loud tier-vs-k8s_users check, sticky-port + ttyd-map + dispatch derivation, additive-only group reconcile, and the staged offboarding diff (reversible cut vs gated userdel, never auto). 27 pytest cases, ruff-clean; no host I/O in the tested path. Verified to reproduce the live dispatch.json byte-for-byte from the real roster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 13:38:06 +00:00
Viktor Barzin
6504911a77 matrix: open (tokenless) registration + bot mitigations + #security alert
User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser
challenges break native clients). Bot defense is layered instead:
- Traefik rate-limit Middleware on a path-scoped /register ingress carve-out,
  keyed on request Host (GLOBAL /register cap) not source IP — the host is
  reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE
  tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min,
  burst 20, per replica; CrowdSec is the hard backstop on both paths.
- Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing
  #security Slack receiver (matches "registered on this server", never the
  rejection line). tuwunel's admin bot also posts signups to the admin room.

Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert).
Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so
[ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 13:27:02 +00:00
Viktor Barzin
bb7bcf803b multi-user-workstation: design + phased implementation plan
devvm multi-user Claude Code workstation: role-driven profiles (admin/power-user/namespace-owner) off one git roster (single source of truth, full onboard->offboard lifecycle); config inheritance via Claude's native machine-wide managed layer; per-user writable git-crypt-locked infra clone (ungated, apply-time is the boundary); per-tier OIDC kubectl; per-user secrets/auth (memory isolation deferred); incremental, emo-safe migration; capacity prereqs. Folds in gap-analysis findings verified live 2026-06-08. Designed, not yet implemented.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:58:29 +00:00