infra

Author	SHA1	Message	Date
Viktor Barzin	e7fbf986fb	workstation: rename tmux persistence out of the t3 namespace [ci skip] Viktor's correction: this feature is about the tmux web-terminal sessions, not t3 — t3 auto-saves its own threads (~/.t3 state + daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist (units tmux-persist-save.timer / tmux-persist-restore.service, state /var/lib/tmux-persist), header rescoped to say exactly that. Same mechanism, correct taxonomy. Old units removed, state migrated, re-verified live (5 emo + 3 wizard sessions snapshotted). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:42:52 +00:00
Viktor Barzin	2e4f48f3fc	workstation: tmux sessions survive devvm reboots (save timer + boot restore) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor: emo's open web-terminal sessions must persist across reboots. Claude conversations were already durable on disk; the volatile part was the tmux wiring (which named session runs which conversation). t3-tmux-sessions save (5-min timer) snapshots every roster user's sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid taken from argv --resume (self-sustaining once restored) or the newest transcript in the cwd-slug project dir created after process start (fresh launcher sessions; claude does NOT hold its transcript fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore (boot oneshot, also safe after partial loss) recreates missing sessions with claude --resume <uuid>. Reconciler self-heals both units' enablement. Verified live: emo's 5 sessions snapshotted with correct uuids; killed R730-cooling -> restore brought it back resuming the same conversation (context meter identical); other sessions untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:39:32 +00:00
Viktor Barzin	59a531b8e0	coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip] Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP (10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods become ordinary internal clients (CNAME -> apex -> live Traefik LB; mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma monitors that rode the TP-Link NAT loopback (hard-down since 06-09; loopback refuses flows whose source equals the reflection target, which all pfSense-SNAT'd cluster traffic does). Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic to LB IPs; verified from pods on three non-Traefik nodes) — re-verify after major k8s upgrades; canary = [External] fleet going red. The NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both fight return-path asymmetry and deepen TP-Link dependency. Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1, forgejo -> Traefik ClusterIP (pin kept for Technitium-outage resilience). Proxied [External] monitors now test the internal path — true edge fidelity moves to the external vantage (ha-london, next fix). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:21:34 +00:00
Viktor Barzin	35c89fa90c	workstation: managed Claude config self-deploys from the repo [ci skip] Viktor's claudeMd edits must keep reaching every user now that emo is out of the shared tree. Two reconciler additions: - sync_managed_config: installs scripts/workstation/managed-settings.json to /etc/claude-code whenever the repo copy changes — editing the org claudeMd is now edit + commit, no manual install step - refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md (static mirror of the claudeMd; header-guarded so user-customized files are never clobbered) Verified live: corrupted emo's mirror -> reconcile restored it; wizard's stale mirror refreshed; in-sync managed config no-ops. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:03:24 +00:00
Viktor Barzin	8cfd0e5e5c	Merge forgejo/master: reconcile diverged lineages [ci skip] Local checkout carried the 2026-06-10 DNS/registry architecture series (pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes stock) + vzdump/nfs-mirror/workstation-rebuild commits that never reached the canonical remote, while forgejo master received the emo-access series via isolated worktrees. Viktor asked to merge. Conflict resolutions (newest iteration wins in each file): - stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert after live retention orphaned OCI indexes; remote had 06-09 enable) - .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final registry/DNS architecture + implemented vzdump alerts - scripts/workstation/setup-devvm.sh: LOCAL — pinned-version, reproducible-rebuild refactor (kubelogin pin, restructured staging) - scripts/workstation/managed-settings.json: FORGEJO — the allow-then-audit claudeMd (matches /etc deployment byte-for-byte) - scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone intact [ci skip]: all stack changes in the local lineage were applied live this morning — CI would re-walk 100+ stacks via the modules/ fallback for zero state change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:21:50 +00:00
Viktor Barzin	a34f9ff3b8	docs: infra Woodpecker repo-82 ops — in-cluster webhook, secret parity, empty-commit gotcha [ci skip] Emo's first direct pushes surfaced three latent CI issues, all fixed out-of-band today and recorded here: webhook deliveries to ci.viktorbarzin.me timing out on the public-IP hairpin (hook now targets the in-cluster woodpecker-server service), repo 82 registered without the repo-scoped secret set (cloned from repo 1 in the DB), and empty commits compiling every workflow so missing secrets hard-error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:09:17 +00:00
Emil Barzin	63161ef3a5	test: final audit-pipeline verification Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Repo-82 Woodpecker secrets were missing (repo-1 set cloned over) and the webhook now targets the in-cluster service. This push should run the full pipeline: Slack audit ping + no-op apply.	2026-06-10 15:07:15 +00:00
Emil Barzin	619b7608fa	test: verify audit pipeline fires on emo push Second verification: the Forgejo->Woodpecker webhook was timing out on the public-IP hairpin (first test push fired no pipeline), so it now targets the in-cluster Woodpecker service. This push should produce a pipeline with the notify-nonadmin-push Slack step.	2026-06-10 15:03:48 +00:00
Emil Barzin	0f45585b53	test: verify emo direct master push (allow-then-audit) Viktor granted emo direct push to master on 2026-06-10 — any change allowed, tracked via commit messages + the Slack audit feed. This empty commit verifies the whitelist and exercises the new notify-nonadmin-push CI step end-to-end.	2026-06-10 14:54:04 +00:00
Viktor Barzin	a49d1eadf6	workstation: emo direct master push — allow-then-audit [ci skip] Viktor: emo may make any change; what matters is tracking what changed and why. ebarzin added to master push+merge whitelists (force-push stays disabled — append-only history). Tracking enforced three ways: - agent instructions (managed claudeMd + AGENTS.md): commit body MUST carry the user's plain-language intent; commits land on master directly; [ci skip] forbidden for non-admins - new notify-nonadmin-push step in .woodpecker/default.yml: Slack message for every non-admin master push (admin pushes silent) - PR flow remains the fallback for non-whitelisted users Accepted consequence (informed): emo's pushes auto-apply changed stacks via CI. Offboard runbook gains whitelist-removal step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:53:43 +00:00
Viktor Barzin	6d8773573c	workstation: agent-driven contribute flow for non-technical users [ci skip] emo can't use git — his agent must do all VCS mechanics invisibly. Managed claudeMd (every session, top precedence) now instructs agents: commit -> push <os-user>/<topic> branch -> open PR via Forgejo API (user's PAT from ~/.git-credentials) -> back to clean master -> tell the user in plain words it's submitted for review. AGENTS.md carries the full recipe with the curl call. Verified live as emo: PR #1 opened (HTTP 201, write:repository scope suffices) and closed via his PAT. Deployed to /etc/claude-code/managed-settings.json; codex AGENTS.md mirrors for emo + ancamilea regenerated from the new claudeMd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 10:12:26 +00:00
Viktor Barzin	2e5af5dc0e	workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip] Non-admins (emo) need current master without manual pulls. Two layers: - t3-provision-users reconcile gains refresh_locked_clone: fetch all remotes + ff-only master, guarded (on master, clean tree, upstream set); dirty/diverged clones are left alone with a WARN. - start-claude.sh freshens ~/code at session launch, 15s-capped so an offline remote never delays the session. Verified live on emo's clone: stale clone ff'd to tip by the reconciler; launcher snippet ff's when clean and refuses while a dirty file exists. Deployed to /usr/local/bin/t3-provision-users, /etc/skel/start-claude.sh, and emo's launcher. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:41:38 +00:00
Viktor Barzin	5d9417fbaa	workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip] ADR-0004's premise was wrong: pushing master fires the Woodpecker apply pipeline (require_approval=forks only), so master pushes ARE deploys. Added Forgejo branch protection on master (push/merge whitelist=viktor, deploy keys allowed); non-admins contribute via branches + PRs. emo (ebarzin): write collaborator on viktor/infra, PAT in ~/.git-credentials, forgejo remote + upstream in his locked clone. Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they ARE the skel shared-base mechanism — plan step 4c obsolete). Offboard runbook: revoke PAT + collaborator + group steps added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:30:41 +00:00
Viktor Barzin	a1b7b0ca53	forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip] The keep-set (newest 10 versions + latest + cache tags) treats multi-arch/attestation index CHILDREN — separate untagged sha256 versions — as deletable: for images not rebuilt recently they sort outside the newest-10 window and were pruned while their kept parent index survived. kms-website :latest and :dfc83fb children 404'd (RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe within hours; deployed tag a794d1a unaffected). Healed: :latest re-pointed at the intact a794d1a index (also the newest commit), corrupt :dfc83fb version deleted, probe re-run clean (0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied live. Re-enable only with a container-aware keep-set — options in the post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:22:47 +00:00
Viktor Barzin	e49c91e60c	monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl, mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success. NOT [ci]-applied: this is a Terraform stack change — arms on the next `scripts/tg apply` of the monitoring stack (metrics already flow, so it arms immediately once applied). Admin-gated apply per org policy. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:10:46 +00:00
Viktor Barzin	05f928931f	workstation: packages.txt — add provisioner build deps + uncaptured core tools setup-devvm.sh now needs golang-go (builds t3-dispatch in section 9) and uses unzip (kubelogin extraction); neither was in the manifest, so a fresh box would skip the t3-dispatch build. Also add build-essential (cgo / npm native modules) + core tools that were manually-installed but uncaptured (rsync, wget, tree, shellcheck). Noted gh as non-apt (GitHub's own repo). All verified to resolve in apt. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:08:53 +00:00
Viktor Barzin	312c418a9a	workstation: setup-devvm.sh installs the systemd service layer (reproducible rebuild) The t3 system units (t3-serve@, t3-autoupdate, t3-backup-state, t3-provision-users, t3-dispatch) + the t3-dispatch Go binary + t3-mint + the sudoers grant were all hand-scp'd and would NOT survive a fresh devvm. setup-devvm.sh now installs + enables them: build-if-absent for the Go binary, visudo-validated sudoers (a malformed /etc/sudoers.d file breaks all sudo), timers self-heal, t3-dispatch system account created if absent. t3-serve@ stays a per-user template enabled by the provisioner; the ttyd terminal-lobby chain ships from its own repo (viktor/terminal-lobby). Verified: shellcheck clean, go build compiles, visudo parses the sudoers, units parse. NOT run live (would re-assert apt/npm on the shared host) — exercised on next rebuild. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:07:20 +00:00
Viktor Barzin	d9ea7812f5	nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms (added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the 02:00 nfs-mirror run silently deleted both successful 40G devvm images (verified: dir gone, 40G freed, despite status=0 success logs). Add --exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/ sqlite-backup excludes that exist for exactly this reason. TDD-proven with an isolated rsync --delete -n -v. backup-dr.md notes the dependency. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:04:57 +00:00
Viktor Barzin	2b8c0def30	dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 08:32:34 +00:00
Viktor Barzin	1ee1bf0817	forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip] Supersedes this morning's per-node /etc/hosts pin (no hardcoded service IPs on nodes, per Viktor). Technitium's split-horizon zone already resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP (ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe alerts) -- the nodes just never queried it. Rolled the devvm's systemd-resolved routing-domain pattern (~viktorbarzin.me -> 10.0.20.201) to all 7 nodes, removed the pins, verified getent + crictl pull via pure DNS. Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1) to FallbackDNS-only: public servers in the global set race the routing domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete -- exactly the stale comment that pointed new nodes at the hairpin. hosts.toml mirror kept but documented as vestigial (Traefik 404s bare-IP requests; registry auth realm is an absolute URL). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:56:31 +00:00
Viktor Barzin	b6976ce014	forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip] tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet pulls of forgejo.viktorbarzin.me images depended on the intermittently broken public-IP hairpin. The containerd hosts.toml mirror cannot keep pulls internal on its own — Traefik 404s its bare-IP requests (no Host/SNI match) and the registry Bearer realm is an absolute public URL fetched outside the mirror. Third incident of this class (buildkit 06-04, tripit/devvm 06-09). Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node — covers resolve + token + blob legs with correct SNI and valid cert. Applied live to all 7 nodes; persisted in the cloud-init bootstrap and the existing-node rollout script. Docs updated (registry bullet, dns.md hairpin scope + stale .200 literals, runbook) + post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:15:24 +00:00
Viktor Barzin	eb8695743b	workstation: fix setup-devvm.sh provisioner correctness (claude detect, kubelogin pin, codex auth, t3-serve dir) - claude-code: detect via `npm ls -g` not `command -v claude` — the admin's personal ~/.local/bin/claude shadowed the PATH check, so the system-wide install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude; fresh non-admins had no claude). Found during the devvm reproducibility audit. - kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes built weeks apart are byte-identical. - /etc/t3-serve: mkdir before the token writes (install -m doesn't create the parent — section 8 would fail on a fresh box). - codex shared auth: stage /opt/codex-shared/auth.json from Vault secret/workstation.codex_shared_auth_json (key already existed but nothing consumed it — was a manual step lost on rebuild), mirroring the Claude token. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	8886ac7763	backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs First live run produced a valid 40G dump and logged status=0, but the service exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a bash EXIT trap whose LAST command returns non-zero overrides the script's `exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful backup is marked failed (would trip a vzdump staleness/failure alert). Switch to daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix verified locally; redeployed to PVE + reset-failed. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	7330cb6a0b	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	3e7093947d	t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip] Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic dispatch browser-session/bootstrap fallback + Gate-2 real pairing health-check + per-user state.sqlite backup). 0.0.26 verified end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch (302 + Set-Cookie t3_session) after migrating state.sqlite 30->32; pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5 into the t3 model picker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	dacd9d2d8a	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	baac46415f	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	41c11216da	t3-dispatch: re-pair on present-but-invalid t3_session cookie The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09 auth-schema rollback wiped all server-side sessions, browsers kept dead 30-day t3_session cookies; the dispatcher proxied them straight through and t3 rendered its pair page ("all users must pair again"). Now a present cookie on a top-level document navigation is validated via the instance's /api/auth/session and re-paired on authenticated:false. Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html) so XHR/asset/WebSocket sub-requests are never answered with a 302; fails open (proxy through) on any validation error. Unit + handler tests added. [ci skip] Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	e0ab621cb2	workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	39e35ca8c9	workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN) Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	1edccedb1f	workstation: v2 membership implementation plan [ci skip] 8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	87702bdce8	feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible /v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on 8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/. Option A off-peak control (no VRAM isolation on the time-sliced T4 — see post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00 ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at 06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a skipped/aborted window backfills next time. SA+Role+RoleBinding grant the CronJobs deployments/scale (nextcloud-watchdog pattern). Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the `tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure — never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses the base exclude list (tts untouched there). tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to local.app_env (no token — ClusterIP only). No tripit code change. Image build is documented in stacks/tts/README.md (devnen cu128 target -> forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline (large CUDA image + needs the upstream repo). NOT APPLIED — review branch only. Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the measured chatterbox-multilingual T4 peak during the first bake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
viktor	edaee13be3	docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip]	2026-06-09 21:41:53 +00:00
Viktor Barzin	4b44db36da	workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model) The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	64413c76ce	workstation: default Claude model = claude-fable-5 for all devvm users Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
viktor	93ec0c66fd	docs(ci-cd): add off-infra GHA->GHCR build pattern for private Forgejo repos (tripit pilot) [ci skip]	2026-06-09 21:41:53 +00:00
viktor	90b8312a29	tripit: build off-infra via GHA -> GHCR (private), pull via scoped ghcr-credentials Switches tripit image from forgejo.viktorbarzin.me (in-cluster buildkit, sdc load) to ghcr.io/viktorbarzin/tripit built by GitHub Actions (Forgejo push-mirror -> private GitHub -> GHA -> GHCR). Adds a tripit-ns-scoped GHCR pull secret (github_pat, interim). Verified: deploy on :c8dfb5cb ready, ingest-plans CronJob pulled :latest + Succeeded. [ci skip]	2026-06-09 21:41:53 +00:00
Viktor Barzin	e0452611b5	forgejo: survive CI-build registry-push storms (mem 3Gi + working retention) Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt deferred): - Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it kept OOMing against. Size for the push spike. - Activate registry retention (DRY_RUN false). Verified the delete list against all running viktor/* images first: 0 running images affected. Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling. - FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo scopes container packages per-user, so DELETE on viktor/* returned 403 (the dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to viktor's write:package PAT. Retention had never actually worked. - Protect buildkit cache tags from retention (cleanup.sh keep-set) so the gentler-builds layer cache survives daily pruning. [ci skip] — already applied via scripts/tg. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	bc37b16815	backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs First live run produced a valid 40G dump and logged status=0, but the service exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a bash EXIT trap whose LAST command returns non-zero overrides the script's `exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful backup is marked failed (would trip a vzdump staleness/failure alert). Switch to daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix verified locally; redeployed to PVE + reset-failed. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:30:19 +00:00
Viktor Barzin	83f418159a	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:30:19 +00:00
Viktor Barzin	7fc4caefe3	t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip] Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic dispatch browser-session/bootstrap fallback + Gate-2 real pairing health-check + per-user state.sqlite backup). 0.0.26 verified end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch (302 + Set-Cookie t3_session) after migrating state.sqlite 30->32; pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5 into the t3 model picker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	bccaa08d8e	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	5ea238c707	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	2125651aaa	t3-dispatch: re-pair on present-but-invalid t3_session cookie The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09 auth-schema rollback wiped all server-side sessions, browsers kept dead 30-day t3_session cookies; the dispatcher proxied them straight through and t3 rendered its pair page ("all users must pair again"). Now a present cookie on a top-level document navigation is validated via the instance's /api/auth/session and re-paired on authenticated:false. Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html) so XHR/asset/WebSocket sub-requests are never answered with a 302; fails open (proxy through) on any validation error. Unit + handler tests added. [ci skip] Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	fad10a8707	workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	eeadf0f85d	workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN) Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	fbcc330214	workstation: v2 membership implementation plan [ci skip] 8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	48013a4a92	feat(tts): Chatterbox TTS stack + off-peak T4 gate, wire tripit narration [ci skip] New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible /v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on 8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/. Option A off-peak control (no VRAM isolation on the time-sliced T4 — see post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00 ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at 06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a skipped/aborted window backfills next time. SA+Role+RoleBinding grant the CronJobs deployments/scale (nextcloud-watchdog pattern). Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the `tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure — never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses the base exclude list (tts untouched there). tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to local.app_env (no token — ClusterIP only). No tripit code change. Image build is documented in stacks/tts/README.md (devnen cu128 target -> forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline (large CUDA image + needs the upstream repo). NOT APPLIED — review branch only. Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the measured chatterbox-multilingual T4 peak during the first bake. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
viktor	b1a6391a4d	docs(ci-cd): tripit auto-deploy (GHA->Woodpecker 167) + svu semver in GHA [ci skip]	2026-06-09 19:41:08 +00:00
Viktor Barzin	68a237faf7	workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model) The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:35:29 +00:00

1 2 3 4 5 ...

4135 commits