The Actual web app boots with ~70 near-parallel requests (55
/data/migrations/*.sql + statics, all served cache-control max-age=0 so
every page load re-validates them). The shared rate-limit middleware
(average 10, burst 50) 429s the tail of that storm, so every cold boot
shows 'Server returned an error while checking its status' and every
load stalls in retry backoff — measured up to 5min stalls when two
loads from one IP overlap. Viktor asked to relax the limit after the
anca slow-load investigation (beads code-7zv).
Same pattern as immich: dedicated actualbudget-rate-limit middleware in
the traefik stack, budget-* ingresses opt out of the default via
skip_default_rate_limit + extra_middlewares.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated
shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730
fan daemon — through his agent. To keep an audit trail of what that agent does,
and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its
systemd journal to cluster Loki:
- snoopy logs every execve() to journald (identifier=snoopy), enabled via
/etc/ld.so.preload; config scripts/pve-snoopy.ini.
- promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"}
(full host journal; filter identifier="snoopy" for the command audit), and
relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING
only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}.
S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to
192.168.1.2, which is already in the S1 source-IP allowlist.
Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan);
follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers.
Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia
promtail — these files are the source of truth. Docs updated: security.md
(S1 LIVE) and monitoring.md ("External host: pve").
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to verify the book-plotting push->build->deploy chain.
The chain itself is healthy, but the Terraform baseline image said
ancamilea/book-plotter:latest while CI (GHA on
PassionProjectsAnca/Plotting-Your-Dream-Book) builds and deploys
viktorbarzin/book-plotter:<sha8> + :latest — a from-scratch apply
would have resurrected a stale March image. Baseline now
viktorbarzin/book-plotter:latest. No live change: the running tag is
CI-owned via ignore_changes, plan confirms the image attr is ignored.
[ci skip] deliberately: plan shows UNRELATED pre-existing drift on
this stack (live ns labels managed-by=vault-user-onboarding +
resource-governance/custom-quota=true would be stripped; deployment
keel.sh/policy=patch annotations removed) — auto-applying that needs
its own reviewed pass.
Viktor asked to log Anca into her GitHub account so she can develop on
the devvm and deploy her apps through the existing CI/CD. Her GitHub
repos (Plotting-Your-Dream-Book, travel, My-Wardrobe — now cloned into
her ~/code workspace) default to main, and the launcher freshen only
fast-forwarded master, silently skipping them. ff the current branch's
upstream instead — same safety gates (on a branch, clean tree, upstream
configured, ff-only). Single-layout infra clones behave identically.
[ci skip]
The roundtrip probe kept failing after the SPF/MX fix: rspamd's actual
junk-score driver was R_DKIM_PERMFAIL(+4.5) on selector brevo2 — Brevo
signs with brevo1/brevo2._domainkey, which are CNAMEs to
b{1,2}.viktorbarzin-me.dkim.brevo.com in public DNS and were absent
from the internal zone (the earlier existence check used ANY queries,
which Cloudflare refuses per RFC 8482 — false negative). The DKIM
permfail also cascaded into DMARC_POLICY_SOFTFAIL(+1.5), totalling the
6.09/6.0 junk threshold; sieve filed probes into \Junk where the INBOX
poll never finds them.
ingress-dns-sync now maintains both selector CNAMEs. Ops notes: rspamd
caches DNS (restart to flush after zone fixes); CoreDNS denial cache
holds NXDOMAINs up to 300s. Verified: roundtrip SUCCESS in 20.5s.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone
itself; he wants ~/code to be the directory where all her project repos
(tripit etc.) live side by side, with infra moved to a subdirectory.
- roster.yaml gains per-user 'code_layout: single|workspace' + 'repos',
validated + derived by roster_engine.py (12 new tests, 40 total).
- t3-provision-users reconcile: auto-migrates a single-layout ~/code to
~/code/infra (running processes follow the moved inode), hoists nested
project clones to the workspace root, clones roster repos from Forgejo
AS the user (their PAT makes private repos work), and wires the
documented forgejo remote + forgejo/master upstream into clones that
predate that contract.
- Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS
read, shifting later fields left (groups was only safe by being the
last field) — emit '-' sentinels instead.
- start-claude.sh session freshen is layout-aware (freshens each repo
under ~/code for workspace users).
- managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md
updated in the same change.
Applied live: ancamilea = workspace (infra at ~/code/infra, her existing
tripit clone hoisted to ~/code/tripit, master upstream switched to
forgejo/master); emo stays single layout, untouched. [ci skip]
Viktor asked that every feature task be developed in its own git worktree
and merged into master when done, enabling multiple agents to work the
same project concurrently. Encode the org rule in the managed claudeMd
(self-deploys to /etc via the hourly reconcile), add the worktree-first
paragraph to the AGENTS.md non-admin landing recipe, and gitignore
.worktrees/ so per-feature worktrees can live at the repo root. Full
lifecycle: ~/.claude/rules/execution.md §3.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two fixes from the post-DNS-internalization health sweep:
1. The internal viktorbarzin.me zone served only ingress A/CNAME records.
Since the mailserver pods now resolve the domain through it (CoreDNS
viktorbarzin.me:53 -> Technitium, 59a531b8), rspamd's SPF checks on
inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the
Brevo email-roundtrip probe failed from the 16:20 run onward
(EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also
maintains the static mail-auth records (SPF, brevo-code TXT, MX;
DMARC + DKIM were already present), idempotently. Principle: the
internal zone must be a SUPERSET of the public zone for every record
type internal clients consume. Verified in-pod: all four types
resolve; roundtrip re-probe green.
2. cluster_healthcheck #30 queried instant `up`, which goes stale for
~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant
job -> intermittent false "redfish-idrac=missing". Now uses
last_over_time(up[15m]) — same answers for fast jobs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's correction: this feature is about the tmux web-terminal
sessions, not t3 — t3 auto-saves its own threads (~/.t3 state +
daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist
(units tmux-persist-save.timer / tmux-persist-restore.service, state
/var/lib/tmux-persist), header rescoped to say exactly that. Same
mechanism, correct taxonomy. Old units removed, state migrated,
re-verified live (5 emo + 3 wizard sessions snapshotted).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor: emo's open web-terminal sessions must persist across reboots.
Claude conversations were already durable on disk; the volatile part
was the tmux wiring (which named session runs which conversation).
t3-tmux-sessions save (5-min timer) snapshots every roster user's
sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid
taken from argv --resume (self-sustaining once restored) or the
newest transcript in the cwd-slug project dir created after process
start (fresh launcher sessions; claude does NOT hold its transcript
fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore
(boot oneshot, also safe after partial loss) recreates missing
sessions with claude --resume <uuid>. Reconciler self-heals both
units' enablement.
Verified live: emo's 5 sessions snapshotted with correct uuids;
killed R730-cooling -> restore brought it back resuming the same
conversation (context meter identical); other sessions untouched.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).
Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.
Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's claudeMd edits must keep reaching every user now that emo is
out of the shared tree. Two reconciler additions:
- sync_managed_config: installs scripts/workstation/managed-settings.json
to /etc/claude-code whenever the repo copy changes — editing the
org claudeMd is now edit + commit, no manual install step
- refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md
(static mirror of the claudeMd; header-guarded so user-customized
files are never clobbered)
Verified live: corrupted emo's mirror -> reconcile restored it;
wizard's stale mirror refreshed; in-sync managed config no-ops.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Local checkout carried the 2026-06-10 DNS/registry architecture series
(pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes
stock) + vzdump/nfs-mirror/workstation-rebuild commits that never
reached the canonical remote, while forgejo master received the
emo-access series via isolated worktrees. Viktor asked to merge.
Conflict resolutions (newest iteration wins in each file):
- stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert
after live retention orphaned OCI indexes; remote had 06-09 enable)
- .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final
registry/DNS architecture + implemented vzdump alerts
- scripts/workstation/setup-devvm.sh: LOCAL — pinned-version,
reproducible-rebuild refactor (kubelogin pin, restructured staging)
- scripts/workstation/managed-settings.json: FORGEJO — the
allow-then-audit claudeMd (matches /etc deployment byte-for-byte)
- scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone
intact
[ci skip]: all stack changes in the local lineage were applied live
this morning — CI would re-walk 100+ stacks via the modules/ fallback
for zero state change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Emo's first direct pushes surfaced three latent CI issues, all fixed
out-of-band today and recorded here: webhook deliveries to
ci.viktorbarzin.me timing out on the public-IP hairpin (hook now
targets the in-cluster woodpecker-server service), repo 82 registered
without the repo-scoped secret set (cloned from repo 1 in the DB), and
empty commits compiling every workflow so missing secrets hard-error.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Repo-82 Woodpecker secrets were missing (repo-1 set cloned over) and
the webhook now targets the in-cluster service. This push should run
the full pipeline: Slack audit ping + no-op apply.
Second verification: the Forgejo->Woodpecker webhook was timing out on
the public-IP hairpin (first test push fired no pipeline), so it now
targets the in-cluster Woodpecker service. This push should produce a
pipeline with the notify-nonadmin-push Slack step.
Viktor granted emo direct push to master on 2026-06-10 — any change
allowed, tracked via commit messages + the Slack audit feed. This
empty commit verifies the whitelist and exercises the new
notify-nonadmin-push CI step end-to-end.
Viktor: emo may make any change; what matters is tracking what changed
and why. ebarzin added to master push+merge whitelists (force-push
stays disabled — append-only history). Tracking enforced three ways:
- agent instructions (managed claudeMd + AGENTS.md): commit body MUST
carry the user's plain-language intent; commits land on master
directly; [ci skip] forbidden for non-admins
- new notify-nonadmin-push step in .woodpecker/default.yml: Slack
message for every non-admin master push (admin pushes silent)
- PR flow remains the fallback for non-whitelisted users
Accepted consequence (informed): emo's pushes auto-apply changed
stacks via CI. Offboard runbook gains whitelist-removal step.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
emo can't use git — his agent must do all VCS mechanics invisibly.
Managed claudeMd (every session, top precedence) now instructs agents:
commit -> push <os-user>/<topic> branch -> open PR via Forgejo API
(user's PAT from ~/.git-credentials) -> back to clean master -> tell
the user in plain words it's submitted for review. AGENTS.md carries
the full recipe with the curl call.
Verified live as emo: PR #1 opened (HTTP 201, write:repository scope
suffices) and closed via his PAT. Deployed to
/etc/claude-code/managed-settings.json; codex AGENTS.md mirrors for
emo + ancamilea regenerated from the new claudeMd.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Non-admins (emo) need current master without manual pulls. Two layers:
- t3-provision-users reconcile gains refresh_locked_clone: fetch all
remotes + ff-only master, guarded (on master, clean tree, upstream
set); dirty/diverged clones are left alone with a WARN.
- start-claude.sh freshens ~/code at session launch, 15s-capped so an
offline remote never delays the session.
Verified live on emo's clone: stale clone ff'd to tip by the
reconciler; launcher snippet ff's when clean and refuses while a
dirty file exists. Deployed to /usr/local/bin/t3-provision-users,
/etc/skel/start-claude.sh, and emo's launcher.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ADR-0004's premise was wrong: pushing master fires the Woodpecker apply
pipeline (require_approval=forks only), so master pushes ARE deploys.
Added Forgejo branch protection on master (push/merge whitelist=viktor,
deploy keys allowed); non-admins contribute via branches + PRs.
emo (ebarzin): write collaborator on viktor/infra, PAT in
~/.git-credentials, forgejo remote + upstream in his locked clone.
Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they
ARE the skel shared-base mechanism — plan step 4c obsolete).
Offboard runbook: revoke PAT + collaborator + group steps added.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).
Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.
NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
setup-devvm.sh now needs golang-go (builds t3-dispatch in section 9) and uses unzip
(kubelogin extraction); neither was in the manifest, so a fresh box would skip the
t3-dispatch build. Also add build-essential (cgo / npm native modules) + core tools
that were manually-installed but uncaptured (rsync, wget, tree, shellcheck). Noted
gh as non-apt (GitHub's own repo). All verified to resolve in apt.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The t3 system units (t3-serve@, t3-autoupdate, t3-backup-state, t3-provision-users,
t3-dispatch) + the t3-dispatch Go binary + t3-mint + the sudoers grant were all
hand-scp'd and would NOT survive a fresh devvm. setup-devvm.sh now installs + enables
them: build-if-absent for the Go binary, visudo-validated sudoers (a malformed
/etc/sudoers.d file breaks all sudo), timers self-heal, t3-dispatch system account
created if absent. t3-serve@ stays a per-user template enabled by the provisioner;
the ttyd terminal-lobby chain ships from its own repo (viktor/terminal-lobby).
Verified: shellcheck clean, go build compiles, visudo parses the sudoers, units parse.
NOT run live (would re-assert apt/npm on the shared host) — exercised on next rebuild.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup
dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms
(added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the
02:00 nfs-mirror run silently deleted both successful 40G devvm images
(verified: dir gone, 40G freed, despite status=0 success logs). Add
--exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/
sqlite-backup excludes that exist for exactly this reason. TDD-proven with an
isolated rsync --delete -n -v. backup-dr.md notes the dependency.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):
- pfSense Unbound domain override viktorbarzin.me -> Technitium
10.0.20.201 (applied via php write_config, backup on-box). Every
Unbound client on every VLAN now gets the internal split-horizon
answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
the ETP=Local LB IP pfSense now returns), all other .me names kept on
public resolvers (pods' pre-existing behavior). Replaces the .:53
forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
bullet, post-mortem final-architecture addendum.
Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.
Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.
hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).
Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- claude-code: detect via `npm ls -g` not `command -v claude` — the admin's
personal ~/.local/bin/claude shadowed the PATH check, so the system-wide
install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude;
fresh non-admins had no claude). Found during the devvm reproducibility audit.
- kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes
built weeks apart are byte-identical.
- /etc/t3-serve: mkdir before the token writes (install -m doesn't create the
parent — section 8 would fail on a fresh box).
- codex shared auth: stage /opt/codex-shared/auth.json from Vault
secret/workstation.codex_shared_auth_json (key already existed but nothing
consumed it — was a manual step lost on rebuild), mirroring the Claude token.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).
vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.
Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):
- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
/api/auth/bootstrap on 404 — one binary pairs across both versions and any
rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
before, green after). Built, deployed, verified live on 0.0.24 (all three
users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
state.sqlite (was the only copy, unbacked) -> the one-way forward schema
migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).
- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").
Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.
[ci skip]
Co-Authored-By: Claude <noreply@anthropic.com>
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.
Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).
Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).
tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.
Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):
- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
kept OOMing against. Size for the push spike.
- Activate registry retention (DRY_RUN false). Verified the delete list
against all running viktor/* images first: 0 running images affected.
Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.
- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
scopes container packages per-user, so DELETE on viktor/* returned 403 (the
dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
viktor's write:package PAT. Retention had never actually worked.
- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
gentler-builds layer cache survives daily pruning.
[ci skip] — already applied via scripts/tg.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).
vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.
Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>