Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).
Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.
Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's claudeMd edits must keep reaching every user now that emo is
out of the shared tree. Two reconciler additions:
- sync_managed_config: installs scripts/workstation/managed-settings.json
to /etc/claude-code whenever the repo copy changes — editing the
org claudeMd is now edit + commit, no manual install step
- refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md
(static mirror of the claudeMd; header-guarded so user-customized
files are never clobbered)
Verified live: corrupted emo's mirror -> reconcile restored it;
wizard's stale mirror refreshed; in-sync managed config no-ops.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Local checkout carried the 2026-06-10 DNS/registry architecture series
(pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes
stock) + vzdump/nfs-mirror/workstation-rebuild commits that never
reached the canonical remote, while forgejo master received the
emo-access series via isolated worktrees. Viktor asked to merge.
Conflict resolutions (newest iteration wins in each file):
- stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert
after live retention orphaned OCI indexes; remote had 06-09 enable)
- .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final
registry/DNS architecture + implemented vzdump alerts
- scripts/workstation/setup-devvm.sh: LOCAL — pinned-version,
reproducible-rebuild refactor (kubelogin pin, restructured staging)
- scripts/workstation/managed-settings.json: FORGEJO — the
allow-then-audit claudeMd (matches /etc deployment byte-for-byte)
- scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone
intact
[ci skip]: all stack changes in the local lineage were applied live
this morning — CI would re-walk 100+ stacks via the modules/ fallback
for zero state change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Emo's first direct pushes surfaced three latent CI issues, all fixed
out-of-band today and recorded here: webhook deliveries to
ci.viktorbarzin.me timing out on the public-IP hairpin (hook now
targets the in-cluster woodpecker-server service), repo 82 registered
without the repo-scoped secret set (cloned from repo 1 in the DB), and
empty commits compiling every workflow so missing secrets hard-error.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Repo-82 Woodpecker secrets were missing (repo-1 set cloned over) and
the webhook now targets the in-cluster service. This push should run
the full pipeline: Slack audit ping + no-op apply.
Second verification: the Forgejo->Woodpecker webhook was timing out on
the public-IP hairpin (first test push fired no pipeline), so it now
targets the in-cluster Woodpecker service. This push should produce a
pipeline with the notify-nonadmin-push Slack step.
Viktor granted emo direct push to master on 2026-06-10 — any change
allowed, tracked via commit messages + the Slack audit feed. This
empty commit verifies the whitelist and exercises the new
notify-nonadmin-push CI step end-to-end.
Viktor: emo may make any change; what matters is tracking what changed
and why. ebarzin added to master push+merge whitelists (force-push
stays disabled — append-only history). Tracking enforced three ways:
- agent instructions (managed claudeMd + AGENTS.md): commit body MUST
carry the user's plain-language intent; commits land on master
directly; [ci skip] forbidden for non-admins
- new notify-nonadmin-push step in .woodpecker/default.yml: Slack
message for every non-admin master push (admin pushes silent)
- PR flow remains the fallback for non-whitelisted users
Accepted consequence (informed): emo's pushes auto-apply changed
stacks via CI. Offboard runbook gains whitelist-removal step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
emo can't use git — his agent must do all VCS mechanics invisibly.
Managed claudeMd (every session, top precedence) now instructs agents:
commit -> push <os-user>/<topic> branch -> open PR via Forgejo API
(user's PAT from ~/.git-credentials) -> back to clean master -> tell
the user in plain words it's submitted for review. AGENTS.md carries
the full recipe with the curl call.
Verified live as emo: PR #1 opened (HTTP 201, write:repository scope
suffices) and closed via his PAT. Deployed to
/etc/claude-code/managed-settings.json; codex AGENTS.md mirrors for
emo + ancamilea regenerated from the new claudeMd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Non-admins (emo) need current master without manual pulls. Two layers:
- t3-provision-users reconcile gains refresh_locked_clone: fetch all
remotes + ff-only master, guarded (on master, clean tree, upstream
set); dirty/diverged clones are left alone with a WARN.
- start-claude.sh freshens ~/code at session launch, 15s-capped so an
offline remote never delays the session.
Verified live on emo's clone: stale clone ff'd to tip by the
reconciler; launcher snippet ff's when clean and refuses while a
dirty file exists. Deployed to /usr/local/bin/t3-provision-users,
/etc/skel/start-claude.sh, and emo's launcher.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ADR-0004's premise was wrong: pushing master fires the Woodpecker apply
pipeline (require_approval=forks only), so master pushes ARE deploys.
Added Forgejo branch protection on master (push/merge whitelist=viktor,
deploy keys allowed); non-admins contribute via branches + PRs.
emo (ebarzin): write collaborator on viktor/infra, PAT in
~/.git-credentials, forgejo remote + upstream in his locked clone.
Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they
ARE the skel shared-base mechanism — plan step 4c obsolete).
Offboard runbook: revoke PAT + collaborator + group steps added.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).
Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.
NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
setup-devvm.sh now needs golang-go (builds t3-dispatch in section 9) and uses unzip
(kubelogin extraction); neither was in the manifest, so a fresh box would skip the
t3-dispatch build. Also add build-essential (cgo / npm native modules) + core tools
that were manually-installed but uncaptured (rsync, wget, tree, shellcheck). Noted
gh as non-apt (GitHub's own repo). All verified to resolve in apt.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The t3 system units (t3-serve@, t3-autoupdate, t3-backup-state, t3-provision-users,
t3-dispatch) + the t3-dispatch Go binary + t3-mint + the sudoers grant were all
hand-scp'd and would NOT survive a fresh devvm. setup-devvm.sh now installs + enables
them: build-if-absent for the Go binary, visudo-validated sudoers (a malformed
/etc/sudoers.d file breaks all sudo), timers self-heal, t3-dispatch system account
created if absent. t3-serve@ stays a per-user template enabled by the provisioner;
the ttyd terminal-lobby chain ships from its own repo (viktor/terminal-lobby).
Verified: shellcheck clean, go build compiles, visudo parses the sudoers, units parse.
NOT run live (would re-assert apt/npm on the shared host) — exercised on next rebuild.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup
dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms
(added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the
02:00 nfs-mirror run silently deleted both successful 40G devvm images
(verified: dir gone, 40G freed, despite status=0 success logs). Add
--exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/
sqlite-backup excludes that exist for exactly this reason. TDD-proven with an
isolated rsync --delete -n -v. backup-dr.md notes the dependency.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):
- pfSense Unbound domain override viktorbarzin.me -> Technitium
10.0.20.201 (applied via php write_config, backup on-box). Every
Unbound client on every VLAN now gets the internal split-horizon
answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
the ETP=Local LB IP pfSense now returns), all other .me names kept on
public resolvers (pods' pre-existing behavior). Replaces the .:53
forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
bullet, post-mortem final-architecture addendum.
Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.
Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.
hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).
Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- claude-code: detect via `npm ls -g` not `command -v claude` — the admin's
personal ~/.local/bin/claude shadowed the PATH check, so the system-wide
install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude;
fresh non-admins had no claude). Found during the devvm reproducibility audit.
- kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes
built weeks apart are byte-identical.
- /etc/t3-serve: mkdir before the token writes (install -m doesn't create the
parent — section 8 would fail on a fresh box).
- codex shared auth: stage /opt/codex-shared/auth.json from Vault
secret/workstation.codex_shared_auth_json (key already existed but nothing
consumed it — was a manual step lost on rebuild), mirroring the Claude token.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).
vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.
Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):
- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
/api/auth/bootstrap on 404 — one binary pairs across both versions and any
rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
before, green after). Built, deployed, verified live on 0.0.24 (all three
users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
state.sqlite (was the only copy, unbacked) -> the one-way forward schema
migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).
- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").
Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.
[ci skip]
Co-Authored-By: Claude <noreply@anthropic.com>
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.
Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).
Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).
tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.
Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):
- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
kept OOMing against. Size for the push spike.
- Activate registry retention (DRY_RUN false). Verified the delete list
against all running viktor/* images first: 0 running images affected.
Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.
- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
scopes container packages per-user, so DELETE on viktor/* returned 403 (the
dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
viktor's write:package PAT. Retention had never actually worked.
- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
gentler-builds layer cache survives daily pruning.
[ci skip] — already applied via scripts/tg.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).
vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.
Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):
- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
/api/auth/bootstrap on 404 — one binary pairs across both versions and any
rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
before, green after). Built, deployed, verified live on 0.0.24 (all three
users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
state.sqlite (was the only copy, unbacked) -> the one-way forward schema
migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).
- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").
Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.
[ci skip]
Co-Authored-By: Claude <noreply@anthropic.com>
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.
Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).
Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).
tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.
Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>