infra

Author	SHA1	Message	Date
Viktor Barzin	d9ea7812f5	nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms (added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the 02:00 nfs-mirror run silently deleted both successful 40G devvm images (verified: dir gone, 40G freed, despite status=0 success logs). Add --exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/ sqlite-backup excludes that exist for exactly this reason. TDD-proven with an isolated rsync --delete -n -v. backup-dr.md notes the dependency. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:04:57 +00:00
Viktor Barzin	2b8c0def30	dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 08:32:34 +00:00
Viktor Barzin	1ee1bf0817	forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip] Supersedes this morning's per-node /etc/hosts pin (no hardcoded service IPs on nodes, per Viktor). Technitium's split-horizon zone already resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP (ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe alerts) -- the nodes just never queried it. Rolled the devvm's systemd-resolved routing-domain pattern (~viktorbarzin.me -> 10.0.20.201) to all 7 nodes, removed the pins, verified getent + crictl pull via pure DNS. Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1) to FallbackDNS-only: public servers in the global set race the routing domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete -- exactly the stale comment that pointed new nodes at the hairpin. hosts.toml mirror kept but documented as vestigial (Traefik 404s bare-IP requests; registry auth realm is an absolute URL). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:56:31 +00:00
Viktor Barzin	b6976ce014	forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip] tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet pulls of forgejo.viktorbarzin.me images depended on the intermittently broken public-IP hairpin. The containerd hosts.toml mirror cannot keep pulls internal on its own — Traefik 404s its bare-IP requests (no Host/SNI match) and the registry Bearer realm is an absolute public URL fetched outside the mirror. Third incident of this class (buildkit 06-04, tripit/devvm 06-09). Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node — covers resolve + token + blob legs with correct SNI and valid cert. Applied live to all 7 nodes; persisted in the cloud-init bootstrap and the existing-node rollout script. Docs updated (registry bullet, dns.md hairpin scope + stale .200 literals, runbook) + post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:15:24 +00:00
Viktor Barzin	eb8695743b	workstation: fix setup-devvm.sh provisioner correctness (claude detect, kubelogin pin, codex auth, t3-serve dir) - claude-code: detect via `npm ls -g` not `command -v claude` — the admin's personal ~/.local/bin/claude shadowed the PATH check, so the system-wide install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude; fresh non-admins had no claude). Found during the devvm reproducibility audit. - kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes built weeks apart are byte-identical. - /etc/t3-serve: mkdir before the token writes (install -m doesn't create the parent — section 8 would fail on a fresh box). - codex shared auth: stage /opt/codex-shared/auth.json from Vault secret/workstation.codex_shared_auth_json (key already existed but nothing consumed it — was a manual step lost on rebuild), mirroring the Claude token. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	8886ac7763	backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs First live run produced a valid 40G dump and logged status=0, but the service exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a bash EXIT trap whose LAST command returns non-zero overrides the script's `exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful backup is marked failed (would trip a vzdump staleness/failure alert). Switch to daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix verified locally; redeployed to PVE + reset-failed. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	7330cb6a0b	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	3e7093947d	t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip] Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic dispatch browser-session/bootstrap fallback + Gate-2 real pairing health-check + per-user state.sqlite backup). 0.0.26 verified end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch (302 + Set-Cookie t3_session) after migrating state.sqlite 30->32; pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5 into the t3 model picker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	dacd9d2d8a	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	baac46415f	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	41c11216da	t3-dispatch: re-pair on present-but-invalid t3_session cookie The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09 auth-schema rollback wiped all server-side sessions, browsers kept dead 30-day t3_session cookies; the dispatcher proxied them straight through and t3 rendered its pair page ("all users must pair again"). Now a present cookie on a top-level document navigation is validated via the instance's /api/auth/session and re-paired on authenticated:false. Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html) so XHR/asset/WebSocket sub-requests are never answered with a 302; fails open (proxy through) on any validation error. Unit + handler tests added. [ci skip] Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	e0ab621cb2	workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	39e35ca8c9	workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN) Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	1edccedb1f	workstation: v2 membership implementation plan [ci skip] 8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	87702bdce8	feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible /v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on 8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/. Option A off-peak control (no VRAM isolation on the time-sliced T4 — see post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00 ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at 06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a skipped/aborted window backfills next time. SA+Role+RoleBinding grant the CronJobs deployments/scale (nextcloud-watchdog pattern). Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the `tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure — never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses the base exclude list (tts untouched there). tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to local.app_env (no token — ClusterIP only). No tripit code change. Image build is documented in stacks/tts/README.md (devnen cu128 target -> forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline (large CUDA image + needs the upstream repo). NOT APPLIED — review branch only. Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the measured chatterbox-multilingual T4 peak during the first bake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
viktor	edaee13be3	docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip]	2026-06-09 21:41:53 +00:00
Viktor Barzin	4b44db36da	workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model) The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	64413c76ce	workstation: default Claude model = claude-fable-5 for all devvm users Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
viktor	93ec0c66fd	docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip]	2026-06-09 21:41:53 +00:00
viktor	90b8312a29	tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]	2026-06-09 21:41:53 +00:00
Viktor Barzin	e0452611b5	forgejo: survive CI-build registry-push storms (mem 3Gi + working retention) Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt deferred): - Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it kept OOMing against. Size for the push spike. - Activate registry retention (DRY_RUN false). Verified the delete list against all running viktor/* images first: 0 running images affected. Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling. - FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo scopes container packages per-user, so DELETE on viktor/* returned 403 (the dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to viktor's write:package PAT. Retention had never actually worked. - Protect buildkit cache tags from retention (cleanup.sh keep-set) so the gentler-builds layer cache survives daily pruning. [ci skip] — already applied via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	bc37b16815	backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs First live run produced a valid 40G dump and logged status=0, but the service exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a bash EXIT trap whose LAST command returns non-zero overrides the script's `exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful backup is marked failed (would trip a vzdump staleness/failure alert). Switch to daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix verified locally; redeployed to PVE + reset-failed. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:30:19 +00:00
Viktor Barzin	83f418159a	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:30:19 +00:00
Viktor Barzin	7fc4caefe3	t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip] Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic dispatch browser-session/bootstrap fallback + Gate-2 real pairing health-check + per-user state.sqlite backup). 0.0.26 verified end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch (302 + Set-Cookie t3_session) after migrating state.sqlite 30->32; pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5 into the t3 model picker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	bccaa08d8e	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	5ea238c707	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	2125651aaa	t3-dispatch: re-pair on present-but-invalid t3_session cookie The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09 auth-schema rollback wiped all server-side sessions, browsers kept dead 30-day t3_session cookies; the dispatcher proxied them straight through and t3 rendered its pair page ("all users must pair again"). Now a present cookie on a top-level document navigation is validated via the instance's /api/auth/session and re-paired on authenticated:false. Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html) so XHR/asset/WebSocket sub-requests are never answered with a 302; fails open (proxy through) on any validation error. Unit + handler tests added. [ci skip] Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	fad10a8707	workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	eeadf0f85d	workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN) Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	fbcc330214	workstation: v2 membership implementation plan [ci skip] 8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	48013a4a92	feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible /v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on 8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/. Option A off-peak control (no VRAM isolation on the time-sliced T4 — see post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00 ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at 06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a skipped/aborted window backfills next time. SA+Role+RoleBinding grant the CronJobs deployments/scale (nextcloud-watchdog pattern). Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the `tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure — never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses the base exclude list (tts untouched there). tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to local.app_env (no token — ClusterIP only). No tripit code change. Image build is documented in stacks/tts/README.md (devnen cu128 target -> forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline (large CUDA image + needs the upstream repo). NOT APPLIED — review branch only. Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the measured chatterbox-multilingual T4 peak during the first bake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
viktor	b1a6391a4d	docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip]	2026-06-09 19:41:08 +00:00
Viktor Barzin	68a237faf7	workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model) The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:35:29 +00:00
Viktor Barzin	64f405db36	workstation: default Claude model = claude-fable-5 for all devvm users Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 18:31:27 +00:00
viktor	8eb0bb244f	docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip]	2026-06-09 18:20:54 +00:00
viktor	1f23ba6929	tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]	2026-06-09 18:18:13 +00:00
Viktor Barzin	c5bda77731	forgejo: survive CI-build registry-push storms (mem 3Gi + working retention) Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt deferred): - Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it kept OOMing against. Size for the push spike. - Activate registry retention (DRY_RUN false). Verified the delete list against all running viktor/* images first: 0 running images affected. Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling. - FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo scopes container packages per-user, so DELETE on viktor/* returned 403 (the dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to viktor's write:package PAT. Retention had never actually worked. - Protect buildkit cache tags from retention (cleanup.sh keep-set) so the gentler-builds layer cache survives daily pruning. [ci skip] — already applied via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 14:36:17 +00:00
Viktor Barzin	1e6e5c4ee9	t3code: enable t3-autoupdate.timer from the hourly provisioner The unit files (t3-autoupdate.{timer,service,sh}) were committed but nothing ever enabled the timer, so it sat `disabled` and every t3-serve@ instance silently froze on an old t3 build (all users were on v0.0.24 while nightly was 0.0.25-nightly.20260608). Enable it from the hourly reconciler (not the once-at-provision setup-devvm.sh) so it self-heals if ever disabled again. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 14:09:55 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	05b50d2b96	workstation: v2 membership design — Authentik-group-driven, email-identified [ci skip] Supersedes the MEMBERSHIP model of the 2026-06-07 design (roster.yaml SSoT). Key principle: workstation access (T3 Users group membership) is decoupled from cluster authorization (k8s_users + kubernetes-* groups, untouched). A user is defined once in Authentik: email + T3 Users membership + optional os_user attribute. Provisioner reconciles accounts from the Authentik API; roster.yaml retires. v1 foundation (config inheritance, locked clone, kubeconfig, swap, hardening, emo cutover) unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 07:19:25 +00:00
Viktor Barzin	98fe65e345	storage: migrate priority-pass uploads off proxmox-lvm-encrypted to NFS (Phase 1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Boarding-pass images, no embedded DB. Drops LUKS-at-rest (low-sensitivity, accepted). 21.8M copied + verified on NFS; pod 2/2 on NFS; frees one proxmox-csi slot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 18:47:07 +00:00
Viktor Barzin	06f5c12476	workstation: setup-devvm.sh hardens the admin's unlocked tree (o-rx, not world-readable) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Codifies the leak fix found during the emo cutover: /home/wizard/code is git-crypt-DECRYPTED in the admin's working tree, but was mode 0775 (o+rx) — so any devvm user (even outside code-shared) could read decrypted secrets by path (verified: emo read certificate.pfx as plaintext DER). setup-devvm.sh now chmod o-rx the admin tree so a rebuild keeps it. Live fix already applied (now drwxrws---). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 18:08:52 +00:00
Viktor Barzin	37626cb89b	workstation: docs — mark RBAC + Authentik gate applied [ci skip] multi-tenancy.md + service-catalog.md status: per-user OIDC kubeconfig, oidc-power-user-readonly ClusterRole, emo k8s_users entry, and the Authentik T3 Users edge gate are now applied + verified. Remaining: emo cutover (Phase 5, held), offboarding apply-side (Phase 7), per-user MCP injection, roster-reconciled group membership. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:51:44 +00:00
Viktor Barzin	5c378dd5e3	workstation: gate t3.viktorbarzin.me to the T3 Users group (Phase 4) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New authentik_group 'T3 Users' (members wizard/emo/ancamilea via data lookups — usernames ARE their emails in this Authentik instance) + a branch in the admin-services-restriction expression policy gating t3.viktorbarzin.me to that group, placed BEFORE the ADMIN_ONLY_HOSTS early-return. Surgical two-step targeted apply (group-with-members first, then the gate) → zero lock-out window. Verified: group has all 3 members, the live policy contains the t3 branch, t3 still 302s to Authentik. Membership is HCL for now (FUTURE: roster-reconciled via the Authentik API). Note: the authentik stack had 3 unrelated pending drift changes (pgbouncer deployment + 2 tls_secrets) — deliberately NOT applied (targeted apply isolated this change; left for the stack owner). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:50:40 +00:00
Viktor Barzin	173b1fc116	workstation: per-user OIDC kubectl — power-user-readonly RBAC + kubeconfig (Phase 2.2) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged. Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:47:00 +00:00
Viktor Barzin	c611ecf84d	workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip] multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:27:17 +00:00
Viktor Barzin	08bf1e0a3a	workstation: per-user writable git-crypt-locked infra clone (Phase 3.1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details install_locked_clone: non-admins get their OWN ~/code = a keyless clone of the public infra repo (the monorepo has no remote, so the locked clone is of infra). filter.git-crypt=cat + --no-checkout ⇒ code/docs plaintext, secret files (.tfvars/.tfstate/secrets/**) stay \0GITCRYPT\0 ciphertext. Writable + ungated (push != apply). Skip-if-exists ⇒ never touches emo's existing ~/code symlink (gated cutover handles that). Verified live on ancamilea: secrets ciphertext, code plaintext, commit works, emo untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:23:57 +00:00
Viktor Barzin	2c1865eabb	workstation: roster-driven provisioner (SSoT reconcile, additive-only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details t3-provision-users.sh now consumes roster_engine.py: derives accounts + per-tier groups + sticky ports + /etc/ttyd-user-map + dispatch.json from roster.yaml and applies them. ADDITIVE-ONLY for existing users (never strips a group, replaces a home, or re-locks an account) so the hourly timer is always safe. Best-effort tier validation vs live k8s_users: warns on a net-new absent user (emo), aborts only on a real tier conflict, skips when root has no Vault token. DRY_RUN mode for safe testing. Verified on the live host: reproduces dispatch.json content exactly, emo/anca groups + all t3-serve instances unchanged, idempotent, shellcheck-clean; deployed to /usr/local/bin (hourly timer target). Engine: validate_tiers now returns ValidationIssue(severity) — error=conflict (abort) vs warn=absent (grant pending) — + has_blocking_errors(); 28 pytest cases. setup-devvm.sh redeploys the provisioner for reproducibility. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:18:12 +00:00
Viktor Barzin	3feb69e379	workstation: pin verified config-inheritance mechanism in design §4 [ci skip] Spike GO (claude 2.1.168): managed claudeMd reaches a session; no managed-skills key exists so skills/rules inherit via per-user ~/.claude symlinks to the base (seeded in /etc/skel). Records the settings.json 0664->0600 leak fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:09:13 +00:00

1 2 3 4 5 ...

4168 commits