infra

Author	SHA1	Message	Date
Viktor Barzin	00bc1e052d	technitium: mirror mail-auth records into internal zone; fix redfish check [ci skip] Two fixes from the post-DNS-internalization health sweep: 1. The internal viktorbarzin.me zone served only ingress A/CNAME records. Since the mailserver pods now resolve the domain through it (CoreDNS viktorbarzin.me:53 -> Technitium, `59a531b8`), rspamd's SPF checks on inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the Brevo email-roundtrip probe failed from the 16:20 run onward (EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also maintains the static mail-auth records (SPF, brevo-code TXT, MX; DMARC + DKIM were already present), idempotently. Principle: the internal zone must be a SUPERSET of the public zone for every record type internal clients consume. Verified in-pod: all four types resolve; roundtrip re-probe green. 2. cluster_healthcheck #30 queried instant `up`, which goes stale for ~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant job -> intermittent false "redfish-idrac=missing". Now uses last_over_time(up[15m]) — same answers for fast jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:46:37 +00:00
Viktor Barzin	e7fbf986fb	workstation: rename tmux persistence out of the t3 namespace [ci skip] Viktor's correction: this feature is about the tmux web-terminal sessions, not t3 — t3 auto-saves its own threads (~/.t3 state + daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist (units tmux-persist-save.timer / tmux-persist-restore.service, state /var/lib/tmux-persist), header rescoped to say exactly that. Same mechanism, correct taxonomy. Old units removed, state migrated, re-verified live (5 emo + 3 wizard sessions snapshotted). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:42:52 +00:00
Viktor Barzin	2e4f48f3fc	workstation: tmux sessions survive devvm reboots (save timer + boot restore) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor: emo's open web-terminal sessions must persist across reboots. Claude conversations were already durable on disk; the volatile part was the tmux wiring (which named session runs which conversation). t3-tmux-sessions save (5-min timer) snapshots every roster user's sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid taken from argv --resume (self-sustaining once restored) or the newest transcript in the cwd-slug project dir created after process start (fresh launcher sessions; claude does NOT hold its transcript fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore (boot oneshot, also safe after partial loss) recreates missing sessions with claude --resume <uuid>. Reconciler self-heals both units' enablement. Verified live: emo's 5 sessions snapshotted with correct uuids; killed R730-cooling -> restore brought it back resuming the same conversation (context meter identical); other sessions untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:39:32 +00:00
Viktor Barzin	35c89fa90c	workstation: managed Claude config self-deploys from the repo [ci skip] Viktor's claudeMd edits must keep reaching every user now that emo is out of the shared tree. Two reconciler additions: - sync_managed_config: installs scripts/workstation/managed-settings.json to /etc/claude-code whenever the repo copy changes — editing the org claudeMd is now edit + commit, no manual install step - refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md (static mirror of the claudeMd; header-guarded so user-customized files are never clobbered) Verified live: corrupted emo's mirror -> reconcile restored it; wizard's stale mirror refreshed; in-sync managed config no-ops. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:03:24 +00:00
Viktor Barzin	8cfd0e5e5c	Merge forgejo/master: reconcile diverged lineages [ci skip] Local checkout carried the 2026-06-10 DNS/registry architecture series (pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes stock) + vzdump/nfs-mirror/workstation-rebuild commits that never reached the canonical remote, while forgejo master received the emo-access series via isolated worktrees. Viktor asked to merge. Conflict resolutions (newest iteration wins in each file): - stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert after live retention orphaned OCI indexes; remote had 06-09 enable) - .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final registry/DNS architecture + implemented vzdump alerts - scripts/workstation/setup-devvm.sh: LOCAL — pinned-version, reproducible-rebuild refactor (kubelogin pin, restructured staging) - scripts/workstation/managed-settings.json: FORGEJO — the allow-then-audit claudeMd (matches /etc deployment byte-for-byte) - scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone intact [ci skip]: all stack changes in the local lineage were applied live this morning — CI would re-walk 100+ stacks via the modules/ fallback for zero state change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:21:50 +00:00
Viktor Barzin	a49d1eadf6	workstation: emo direct master push — allow-then-audit [ci skip] Viktor: emo may make any change; what matters is tracking what changed and why. ebarzin added to master push+merge whitelists (force-push stays disabled — append-only history). Tracking enforced three ways: - agent instructions (managed claudeMd + AGENTS.md): commit body MUST carry the user's plain-language intent; commits land on master directly; [ci skip] forbidden for non-admins - new notify-nonadmin-push step in .woodpecker/default.yml: Slack message for every non-admin master push (admin pushes silent) - PR flow remains the fallback for non-whitelisted users Accepted consequence (informed): emo's pushes auto-apply changed stacks via CI. Offboard runbook gains whitelist-removal step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:53:43 +00:00
Viktor Barzin	6d8773573c	workstation: agent-driven contribute flow for non-technical users [ci skip] emo can't use git — his agent must do all VCS mechanics invisibly. Managed claudeMd (every session, top precedence) now instructs agents: commit -> push <os-user>/<topic> branch -> open PR via Forgejo API (user's PAT from ~/.git-credentials) -> back to clean master -> tell the user in plain words it's submitted for review. AGENTS.md carries the full recipe with the curl call. Verified live as emo: PR #1 opened (HTTP 201, write:repository scope suffices) and closed via his PAT. Deployed to /etc/claude-code/managed-settings.json; codex AGENTS.md mirrors for emo + ancamilea regenerated from the new claudeMd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 10:12:26 +00:00
Viktor Barzin	2e5af5dc0e	workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip] Non-admins (emo) need current master without manual pulls. Two layers: - t3-provision-users reconcile gains refresh_locked_clone: fetch all remotes + ff-only master, guarded (on master, clean tree, upstream set); dirty/diverged clones are left alone with a WARN. - start-claude.sh freshens ~/code at session launch, 15s-capped so an offline remote never delays the session. Verified live on emo's clone: stale clone ff'd to tip by the reconciler; launcher snippet ff's when clean and refuses while a dirty file exists. Deployed to /usr/local/bin/t3-provision-users, /etc/skel/start-claude.sh, and emo's launcher. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:41:38 +00:00
Viktor Barzin	05f928931f	workstation: packages.txt — add provisioner build deps + uncaptured core tools setup-devvm.sh now needs golang-go (builds t3-dispatch in section 9) and uses unzip (kubelogin extraction); neither was in the manifest, so a fresh box would skip the t3-dispatch build. Also add build-essential (cgo / npm native modules) + core tools that were manually-installed but uncaptured (rsync, wget, tree, shellcheck). Noted gh as non-apt (GitHub's own repo). All verified to resolve in apt. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:08:53 +00:00
Viktor Barzin	312c418a9a	workstation: setup-devvm.sh installs the systemd service layer (reproducible rebuild) The t3 system units (t3-serve@, t3-autoupdate, t3-backup-state, t3-provision-users, t3-dispatch) + the t3-dispatch Go binary + t3-mint + the sudoers grant were all hand-scp'd and would NOT survive a fresh devvm. setup-devvm.sh now installs + enables them: build-if-absent for the Go binary, visudo-validated sudoers (a malformed /etc/sudoers.d file breaks all sudo), timers self-heal, t3-dispatch system account created if absent. t3-serve@ stays a per-user template enabled by the provisioner; the ttyd terminal-lobby chain ships from its own repo (viktor/terminal-lobby). Verified: shellcheck clean, go build compiles, visudo parses the sudoers, units parse. NOT run live (would re-assert apt/npm on the shared host) — exercised on next rebuild. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:07:20 +00:00
Viktor Barzin	d9ea7812f5	nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms (added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the 02:00 nfs-mirror run silently deleted both successful 40G devvm images (verified: dir gone, 40G freed, despite status=0 success logs). Add --exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/ sqlite-backup excludes that exist for exactly this reason. TDD-proven with an isolated rsync --delete -n -v. backup-dr.md notes the dependency. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:04:57 +00:00
Viktor Barzin	2b8c0def30	dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 08:32:34 +00:00
Viktor Barzin	1ee1bf0817	forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip] Supersedes this morning's per-node /etc/hosts pin (no hardcoded service IPs on nodes, per Viktor). Technitium's split-horizon zone already resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP (ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe alerts) -- the nodes just never queried it. Rolled the devvm's systemd-resolved routing-domain pattern (~viktorbarzin.me -> 10.0.20.201) to all 7 nodes, removed the pins, verified getent + crictl pull via pure DNS. Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1) to FallbackDNS-only: public servers in the global set race the routing domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete -- exactly the stale comment that pointed new nodes at the hairpin. hosts.toml mirror kept but documented as vestigial (Traefik 404s bare-IP requests; registry auth realm is an absolute URL). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:56:31 +00:00
Viktor Barzin	b6976ce014	forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip] tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet pulls of forgejo.viktorbarzin.me images depended on the intermittently broken public-IP hairpin. The containerd hosts.toml mirror cannot keep pulls internal on its own — Traefik 404s its bare-IP requests (no Host/SNI match) and the registry Bearer realm is an absolute public URL fetched outside the mirror. Third incident of this class (buildkit 06-04, tripit/devvm 06-09). Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node — covers resolve + token + blob legs with correct SNI and valid cert. Applied live to all 7 nodes; persisted in the cloud-init bootstrap and the existing-node rollout script. Docs updated (registry bullet, dns.md hairpin scope + stale .200 literals, runbook) + post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:15:24 +00:00
Viktor Barzin	eb8695743b	workstation: fix setup-devvm.sh provisioner correctness (claude detect, kubelogin pin, codex auth, t3-serve dir) - claude-code: detect via `npm ls -g` not `command -v claude` — the admin's personal ~/.local/bin/claude shadowed the PATH check, so the system-wide install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude; fresh non-admins had no claude). Found during the devvm reproducibility audit. - kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes built weeks apart are byte-identical. - /etc/t3-serve: mkdir before the token writes (install -m doesn't create the parent — section 8 would fail on a fresh box). - codex shared auth: stage /opt/codex-shared/auth.json from Vault secret/workstation.codex_shared_auth_json (key already existed but nothing consumed it — was a manual step lost on rebuild), mirroring the Claude token. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	8886ac7763	backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs First live run produced a valid 40G dump and logged status=0, but the service exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a bash EXIT trap whose LAST command returns non-zero overrides the script's `exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful backup is marked failed (would trip a vzdump staleness/failure alert). Switch to daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix verified locally; redeployed to PVE + reset-failed. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	7330cb6a0b	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	3e7093947d	t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip] Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic dispatch browser-session/bootstrap fallback + Gate-2 real pairing health-check + per-user state.sqlite backup). 0.0.26 verified end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch (302 + Set-Cookie t3_session) after migrating state.sqlite 30->32; pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5 into the t3 model picker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	dacd9d2d8a	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	baac46415f	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	41c11216da	t3-dispatch: re-pair on present-but-invalid t3_session cookie The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09 auth-schema rollback wiped all server-side sessions, browsers kept dead 30-day t3_session cookies; the dispatcher proxied them straight through and t3 rendered its pair page ("all users must pair again"). Now a present cookie on a top-level document navigation is validated via the instance's /api/auth/session and re-paired on authenticated:false. Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html) so XHR/asset/WebSocket sub-requests are never answered with a 302; fails open (proxy through) on any validation error. Unit + handler tests added. [ci skip] Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	e0ab621cb2	workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	39e35ca8c9	workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN) Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	4b44db36da	workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model) The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	64413c76ce	workstation: default Claude model = claude-fable-5 for all devvm users Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	bc37b16815	backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs First live run produced a valid 40G dump and logged status=0, but the service exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a bash EXIT trap whose LAST command returns non-zero overrides the script's `exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful backup is marked failed (would trip a vzdump staleness/failure alert). Switch to daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix verified locally; redeployed to PVE + reset-failed. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:30:19 +00:00
Viktor Barzin	83f418159a	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:30:19 +00:00
Viktor Barzin	7fc4caefe3	t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip] Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic dispatch browser-session/bootstrap fallback + Gate-2 real pairing health-check + per-user state.sqlite backup). 0.0.26 verified end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch (302 + Set-Cookie t3_session) after migrating state.sqlite 30->32; pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5 into the t3 model picker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	bccaa08d8e	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	5ea238c707	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	2125651aaa	t3-dispatch: re-pair on present-but-invalid t3_session cookie The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09 auth-schema rollback wiped all server-side sessions, browsers kept dead 30-day t3_session cookies; the dispatcher proxied them straight through and t3 rendered its pair page ("all users must pair again"). Now a present cookie on a top-level document navigation is validated via the instance's /api/auth/session and re-paired on authenticated:false. Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html) so XHR/asset/WebSocket sub-requests are never answered with a 302; fails open (proxy through) on any validation error. Unit + handler tests added. [ci skip] Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	fad10a8707	workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	eeadf0f85d	workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN) Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	68a237faf7	workstation: skel start-claude.sh inherits managed default model (drop hardcoded --model) The per-user launcher hardcoded --model claude-opus-4-8; an explicit --model flag overrides the managed default in /etc/claude-code/managed-settings.json (claude-fable-5). Dropping it lets emo and all new accounts inherit the org default (per-session /model still works). Deployed to /etc/skel and emo live copy in the same change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:35:29 +00:00
Viktor Barzin	64f405db36	workstation: default Claude model = claude-fable-5 for all devvm users Adds a model key (claude-fable-5) to the machine-wide managed-settings.json (installed to /etc/claude-code/ by setup-devvm.sh). Sets the default model for every Claude Code session on the devvm (CLI + t3 web) at top settings precedence; per-session /model and explicit --model flags still override. The org claudeMd block is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 18:31:27 +00:00
Viktor Barzin	1e6e5c4ee9	t3code: enable t3-autoupdate.timer from the hourly provisioner The unit files (t3-autoupdate.{timer,service,sh}) were committed but nothing ever enabled the timer, so it sat `disabled` and every t3-serve@ instance silently froze on an old t3 build (all users were on v0.0.24 while nightly was 0.0.25-nightly.20260608). Enable it from the hourly reconciler (not the once-at-provision setup-devvm.sh) so it self-heals if ever disabled again. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 14:09:55 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	06f5c12476	workstation: setup-devvm.sh hardens the admin's unlocked tree (o-rx, not world-readable) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Codifies the leak fix found during the emo cutover: /home/wizard/code is git-crypt-DECRYPTED in the admin's working tree, but was mode 0775 (o+rx) — so any devvm user (even outside code-shared) could read decrypted secrets by path (verified: emo read certificate.pfx as plaintext DER). setup-devvm.sh now chmod o-rx the admin tree so a rebuild keeps it. Live fix already applied (now drwxrws---). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 18:08:52 +00:00
Viktor Barzin	173b1fc116	workstation: per-user OIDC kubectl — power-user-readonly RBAC + kubeconfig (Phase 2.2) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged. Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:47:00 +00:00
Viktor Barzin	08bf1e0a3a	workstation: per-user writable git-crypt-locked infra clone (Phase 3.1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details install_locked_clone: non-admins get their OWN ~/code = a keyless clone of the public infra repo (the monorepo has no remote, so the locked clone is of infra). filter.git-crypt=cat + --no-checkout ⇒ code/docs plaintext, secret files (.tfvars/.tfstate/secrets/**) stay \0GITCRYPT\0 ciphertext. Writable + ungated (push != apply). Skip-if-exists ⇒ never touches emo's existing ~/code symlink (gated cutover handles that). Verified live on ancamilea: secrets ciphertext, code plaintext, commit works, emo untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:23:57 +00:00
Viktor Barzin	2c1865eabb	workstation: roster-driven provisioner (SSoT reconcile, additive-only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details t3-provision-users.sh now consumes roster_engine.py: derives accounts + per-tier groups + sticky ports + /etc/ttyd-user-map + dispatch.json from roster.yaml and applies them. ADDITIVE-ONLY for existing users (never strips a group, replaces a home, or re-locks an account) so the hourly timer is always safe. Best-effort tier validation vs live k8s_users: warns on a net-new absent user (emo), aborts only on a real tier conflict, skips when root has no Vault token. DRY_RUN mode for safe testing. Verified on the live host: reproduces dispatch.json content exactly, emo/anca groups + all t3-serve instances unchanged, idempotent, shellcheck-clean; deployed to /usr/local/bin (hourly timer target). Engine: validate_tiers now returns ValidationIssue(severity) — error=conflict (abort) vs warn=absent (grant pending) — + has_blocking_errors(); 28 pytest cases. setup-devvm.sh redeploys the provisioner for reproducibility. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:18:12 +00:00
Viktor Barzin	1757cb59e7	workstation: machine-wide config inheritance (managed claudeMd + setup-devvm.sh + skel) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Spike confirmed (claude 2.1.168): /etc/claude-code/managed-settings.json claudeMd reaches a session (sentinel echoed). Hybrid inheritance = enforced org claudeMd machine-wide (top precedence, non-overridable) + per-user ~/.claude/{skills,rules,...} symlinks to the config base (live, the proven emo pattern) seeded via /etc/skel. setup-devvm.sh is idempotent: apt toolset, node>=18 + claude-code, system-wide kubelogin (NOT the Azure apt pkg), the managed config, and /etc/skel (launcher that cd's $HOME/code, tmux UX, inheritance symlinks). Verified: emo unchanged (groups/symlinks/live sessions intact), emo can read the managed config, idempotent re-run clean. Security fix (host state): /home/wizard/.claude/settings.json was 0664, exposing MEMORY_API_KEY to all devvm users -> chmod 0600. chezmoi source needs a private_ prefix + the key templated out to persist this (dotfiles-repo follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:07:04 +00:00
Viktor Barzin	3033e2c355	workstation: roster source-of-truth + host package manifest [ci skip] roster.yaml is the single source of truth for the devvm Workstation lifecycle (os_user -> authentik_user/k8s_user/tier/namespaces); wizard listed as admin so the regenerated ttyd-map/dispatch never drops his instance. packages.txt is the declarative apt toolset (non-apt tools — node/claude-code/kubectl/vault/kubelogin — noted with their real install paths; the apt pkg named 'kubelogin' is the wrong Azure tool). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:38:20 +00:00
Viktor Barzin	7ab4c1e1e2	workstation: tested roster derivation + offboarding-diff engine [ci skip] Pure functional core (PRD ViktorBarzin/infra#9 modules #1 roster engine + #5 offboarding diff) that the bash provisioner will consume as JSON: roster parse/validate, fail-loud tier-vs-k8s_users check, sticky-port + ttyd-map + dispatch derivation, additive-only group reconcile, and the staged offboarding diff (reversible cut vs gated userdel, never auto). 27 pytest cases, ruff-clean; no host I/O in the tested path. Verified to reproduce the live dispatch.json byte-for-byte from the real roster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:38:06 +00:00
Viktor Barzin	d4ec5768b2	vault-token-renew: version the devvm renewer + user units in the repo The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.) - scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access). - scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions). - docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery. - docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
Viktor Barzin	551412488b	apiserver: enable audit logging (low-write Metadata) + ship to Loki Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Resource changes/deletions are now attributable (the novelapp deletion this week was untraceable because apiserver audit was off). Low-write policy: drops reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into the kube-apiserver static-pod manifest + kubeadm-config (v1beta4 extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails /var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}. Root cause that had silently blocked this AND OIDC for weeks: a stray kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate static-pod manifest kubelet ran instead of the real one, dropping every flag added to the real manifest. Removed it. Runbook added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	51456a96f6	fan-control: estimate + expose fan power (fan_watts_est) The iDRAC reports only total DCMI watts + RPM (no per-fan power), so add a cube-law fan-power estimate: fan_W ~= 0.0205*(RPM/1000)^3, calibrated to the 2026-06-05 sweep (fits within ~3W; ~2W floor -> ~99W full). The daemon reads live RPM each loop and pushes pve_fan_control_fan_rpm + _fan_watts_est. Surfaced in HA as sensor.r730_fan_power_est + a "Fan Power (est)" card on the dashboard-it Server view, next to total power. 46 bash tests green; verified live (9120rpm -> ~15W est). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 11:10:27 +00:00
Viktor Barzin	324f2dc3bf	fan-control: continuous linear curve (replaces discrete step-bands) Replace the step-band fan curve with a continuous linear ramp — the bands flapped at edges (e.g. 45<->65%). Web-researched: linear + 2-3C hysteresis is the homelab standard; PID is overkill for this slow thermal loop. fan% now interpolates between env-tunable anchors: COOL 50C/30% -> 83C/100% (~2.1%/C; ~51% at the ~60C equilibrium) QUIET 68C/20% -> 83C/100% (near-silent until ~70C) Both reach 100% at the 83C ceiling. Anti-oscillation: asymmetric hysteresis (fc_decide) + a MIN_STEP (3%) min-change threshold. 41 bash tests green; deployed + verified live (59C -> 49%, smooth). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 10:29:35 +00:00
Viktor Barzin	8beca1dfc7	fan-control: read HA mode/manual-% setpoint (HA fan control) The host daemon now polls input_select.r730_fan_mode (auto/cool/quiet/ manual) + input_number.r730_fan_manual_pct from ha-sofia each loop and routes through fc_resolve: manual holds a fixed %, cool/quiet force that curve, auto keeps the garage-presence behaviour. CEILING still overrides. Ships HA control now on the running host daemon (no Vault); the cluster CronJob migration stays the eventual Terraform home (same logic). HA side (on ha-sofia, auto-git-tracked there): two helpers, an auto- revert-to-auto automation (60min), mode + %-slider control tiles on the dashboard-it Server view. Verified end-to-end: HA manual 70% -> fans 12720rpm; revert to auto -> presence curve 50%. 10 new pure-function tests (fc_resolve/fc_clamp); 46 total green. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:26:22 +00:00

1 2 3 4

165 commits