Commit graph

4168 commits

Author SHA1 Message Date
Viktor Barzin
d9ea7812f5 nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly
nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup
dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms
(added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the
02:00 nfs-mirror run silently deleted both successful 40G devvm images
(verified: dir gone, 40G freed, despite status=0 success logs). Add
--exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/
sqlite-backup excludes that exist for exactly this reason. TDD-proven with an
isolated rsync --delete -n -v. backup-dr.md notes the dependency.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:04:57 +00:00
Viktor Barzin
2b8c0def30 dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip]
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):

- pfSense Unbound domain override viktorbarzin.me -> Technitium
  10.0.20.201 (applied via php write_config, backup on-box). Every
  Unbound client on every VLAN now gets the internal split-horizon
  answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
  forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
  the ETP=Local LB IP pfSense now returns), all other .me names kept on
  public resolvers (pods' pre-existing behavior). Replaces the .:53
  forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
  node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
  for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
  bullet, post-mortem final-architecture addendum.

Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 08:32:34 +00:00
Viktor Barzin
1ee1bf0817 forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip]
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.

Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.

hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:56:31 +00:00
Viktor Barzin
b6976ce014 forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip]
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).

Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:15:24 +00:00
Viktor Barzin
eb8695743b workstation: fix setup-devvm.sh provisioner correctness (claude detect, kubelogin pin, codex auth, t3-serve dir)
- claude-code: detect via `npm ls -g` not `command -v claude` — the admin's
  personal ~/.local/bin/claude shadowed the PATH check, so the system-wide
  install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude;
  fresh non-admins had no claude). Found during the devvm reproducibility audit.
- kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes
  built weeks apart are byte-identical.
- /etc/t3-serve: mkdir before the token writes (install -m doesn't create the
  parent — section 8 would fail on a fresh box).
- codex shared auth: stage /opt/codex-shared/auth.json from Vault
  secret/workstation.codex_shared_auth_json (key already existed but nothing
  consumed it — was a manual step lost on rebuild), mirroring the Claude token.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
8886ac7763 backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
7330cb6a0b backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
3e7093947d t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip]
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
dacd9d2d8a t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
baac46415f t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
41c11216da t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
e0ab621cb2 workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
39e35ca8c9 workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN)
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
1edccedb1f workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
87702bdce8 feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
edaee13be3 docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip] 2026-06-09 21:41:53 +00:00
Viktor Barzin
4b44db36da workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model)
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
64413c76ce workstation: default Claude model = claude-fable-5 for all devvm users
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
93ec0c66fd docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip] 2026-06-09 21:41:53 +00:00
90b8312a29 tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials
Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]
2026-06-09 21:41:53 +00:00
Viktor Barzin
e0452611b5 forgejo: survive CI-build registry-push storms (mem 3Gi + working retention)
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):

- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
  registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
  kept OOMing against. Size for the push spike.

- Activate registry retention (DRY_RUN false). Verified the delete list
  against all running viktor/* images first: 0 running images affected.
  Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.

- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
  scopes container packages per-user, so DELETE on viktor/* returned 403 (the
  dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
  viktor's write:package PAT. Retention had never actually worked.

- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
  gentler-builds layer cache survives daily pruning.

[ci skip] — already applied via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
bc37b16815 backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:30:19 +00:00
Viktor Barzin
83f418159a backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:30:19 +00:00
Viktor Barzin
7fc4caefe3 t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip]
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
bccaa08d8e t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
5ea238c707 t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
2125651aaa t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
fad10a8707 workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
eeadf0f85d workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN)
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
fbcc330214 workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
Viktor Barzin
48013a4a92 feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip]
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.

Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).

Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).

tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.

Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:21:39 +00:00
b1a6391a4d docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip] 2026-06-09 19:41:08 +00:00
Viktor Barzin
68a237faf7 workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model)
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 19:35:29 +00:00
Viktor Barzin
64f405db36 workstation: default Claude model = claude-fable-5 for all devvm users
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:31:27 +00:00
8eb0bb244f docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip] 2026-06-09 18:20:54 +00:00
1f23ba6929 tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials
Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]
2026-06-09 18:18:13 +00:00
Viktor Barzin
c5bda77731 forgejo: survive CI-build registry-push storms (mem 3Gi + working retention)
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):

- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
  registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
  kept OOMing against. Size for the push spike.

- Activate registry retention (DRY_RUN false). Verified the delete list
  against all running viktor/* images first: 0 running images affected.
  Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.

- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
  scopes container packages per-user, so DELETE on viktor/* returned 403 (the
  dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
  viktor's write:package PAT. Retention had never actually worked.

- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
  gentler-builds layer cache survives daily pruning.

[ci skip] — already applied via scripts/tg.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 14:36:17 +00:00
Viktor Barzin
1e6e5c4ee9 t3code: enable t3-autoupdate.timer from the hourly provisioner
The unit files (t3-autoupdate.{timer,service,sh}) were committed but nothing
ever enabled the timer, so it sat `disabled` and every t3-serve@ instance
silently froze on an old t3 build (all users were on v0.0.24 while nightly was
0.0.25-nightly.20260608). Enable it from the hourly reconciler (not the
once-at-provision setup-devvm.sh) so it self-heals if ever disabled again.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 14:09:55 +00:00
Viktor Barzin
fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00
Viktor Barzin
6d224861c4 stem95su: scheduled Drive->site sync CronJob (every 10m)
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.

Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:42:26 +00:00
Viktor Barzin
05b50d2b96 workstation: v2 membership design — Authentik-group-driven, email-identified [ci skip]
Supersedes the MEMBERSHIP model of the 2026-06-07 design (roster.yaml SSoT). Key principle: workstation access (T3 Users group membership) is decoupled from cluster authorization (k8s_users + kubernetes-* groups, untouched). A user is defined once in Authentik: email + T3 Users membership + optional os_user attribute. Provisioner reconciles accounts from the Authentik API; roster.yaml retires. v1 foundation (config inheritance, locked clone, kubeconfig, swap, hardening, emo cutover) unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 07:19:25 +00:00
Viktor Barzin
98fe65e345 storage: migrate priority-pass uploads off proxmox-lvm-encrypted to NFS (Phase 1)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Boarding-pass images, no embedded DB. Drops LUKS-at-rest (low-sensitivity, accepted).
21.8M copied + verified on NFS; pod 2/2 on NFS; frees one proxmox-csi slot.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 18:47:07 +00:00
Viktor Barzin
06f5c12476 workstation: setup-devvm.sh hardens the admin's unlocked tree (o-rx, not world-readable)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Codifies the leak fix found during the emo cutover: /home/wizard/code is git-crypt-DECRYPTED in the admin's working tree, but was mode 0775 (o+rx) — so any devvm user (even outside code-shared) could read decrypted secrets by path (verified: emo read certificate.pfx as plaintext DER). setup-devvm.sh now chmod o-rx the admin tree so a rebuild keeps it. Live fix already applied (now drwxrws---).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 18:08:52 +00:00
Viktor Barzin
37626cb89b workstation: docs — mark RBAC + Authentik gate applied [ci skip]
multi-tenancy.md + service-catalog.md status: per-user OIDC kubeconfig, oidc-power-user-readonly ClusterRole, emo k8s_users entry, and the Authentik T3 Users edge gate are now applied + verified. Remaining: emo cutover (Phase 5, held), offboarding apply-side (Phase 7), per-user MCP injection, roster-reconciled group membership.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:51:44 +00:00
Viktor Barzin
5c378dd5e3 workstation: gate t3.viktorbarzin.me to the T3 Users group (Phase 4)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
New authentik_group 'T3 Users' (members wizard/emo/ancamilea via data lookups — usernames ARE their emails in this Authentik instance) + a branch in the admin-services-restriction expression policy gating t3.viktorbarzin.me to that group, placed BEFORE the ADMIN_ONLY_HOSTS early-return. Surgical two-step targeted apply (group-with-members first, then the gate) → zero lock-out window. Verified: group has all 3 members, the live policy contains the t3 branch, t3 still 302s to Authentik. Membership is HCL for now (FUTURE: roster-reconciled via the Authentik API).

Note: the authentik stack had 3 unrelated pending drift changes (pgbouncer deployment + 2 tls_secrets) — deliberately NOT applied (targeted apply isolated this change; left for the stack owner).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:50:40 +00:00
Viktor Barzin
173b1fc116 workstation: per-user OIDC kubectl — power-user-readonly RBAC + kubeconfig (Phase 2.2)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged.

Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:47:00 +00:00
Viktor Barzin
c611ecf84d workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip]
multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:27:17 +00:00
Viktor Barzin
08bf1e0a3a workstation: per-user writable git-crypt-locked infra clone (Phase 3.1)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
install_locked_clone: non-admins get their OWN ~/code = a keyless clone of the public infra repo (the monorepo has no remote, so the locked clone is of infra). filter.git-crypt=cat + --no-checkout ⇒ code/docs plaintext, secret files (*.tfvars/*.tfstate/secrets/**) stay \0GITCRYPT\0 ciphertext. Writable + ungated (push != apply). Skip-if-exists ⇒ never touches emo's existing ~/code symlink (gated cutover handles that). Verified live on ancamilea: secrets ciphertext, code plaintext, commit works, emo untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:23:57 +00:00
Viktor Barzin
2c1865eabb workstation: roster-driven provisioner (SSoT reconcile, additive-only)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
t3-provision-users.sh now consumes roster_engine.py: derives accounts + per-tier groups + sticky ports + /etc/ttyd-user-map + dispatch.json from roster.yaml and applies them. ADDITIVE-ONLY for existing users (never strips a group, replaces a home, or re-locks an account) so the hourly timer is always safe. Best-effort tier validation vs live k8s_users: warns on a net-new absent user (emo), aborts only on a real tier conflict, skips when root has no Vault token. DRY_RUN mode for safe testing. Verified on the live host: reproduces dispatch.json content exactly, emo/anca groups + all t3-serve instances unchanged, idempotent, shellcheck-clean; deployed to /usr/local/bin (hourly timer target).

Engine: validate_tiers now returns ValidationIssue(severity) — error=conflict (abort) vs warn=absent (grant pending) — + has_blocking_errors(); 28 pytest cases. setup-devvm.sh redeploys the provisioner for reproducibility.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:18:12 +00:00
Viktor Barzin
3feb69e379 workstation: pin verified config-inheritance mechanism in design §4 [ci skip]
Spike GO (claude 2.1.168): managed claudeMd reaches a session; no managed-skills key exists so skills/rules inherit via per-user ~/.claude symlinks to the base (seeded in /etc/skel). Records the settings.json 0664->0600 leak fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 14:09:13 +00:00