Commit graph

4154 commits

Author SHA1 Message Date
Viktor Barzin
70442ccdc6 t3-probe: fix aiohttp 3.9 compat (ClientWSTimeout is 3.10+)
Bound connection establishment via session ClientTimeout(total=None,
connect=15) instead — works on 3.9 through current; total must stay None
or the session timeout would kill the long-lived probe WS. Verified by a
local 14s smoke run: cloudflare + internal legs both connect.
2026-06-10 21:26:09 +00:00
Viktor Barzin
a734155fb5 Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-10 21:11:30 +00:00
Viktor Barzin
9b55d53be0 t3: differential drop-attribution probe + devvm metrics
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.

The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.

Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
2026-06-10 21:11:29 +00:00
Viktor Barzin
ecef09ab87 tmux-persist: add single-user restore mode (restore [user])
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The web-terminal will get a "Restore sessions" button (common ask after an
OOM kills a user's tmux server without a reboot, which the boot-only restore
service doesn't catch). The button needs to restore ONE user's saved sessions
on demand, so teach `restore` an optional <user> argument: with no arg it
restores every terminal user (unchanged — the boot service path), with a
<user> arg it validates the name against /etc/ttyd-user-map and restores only
that user. Reuses the existing restore loop (single source of restore truth).

The terminal-lobby tmux-api will invoke this as root via a validated
tmux-restore-user sudo wrapper. Verified: bad user exits 2 (won't fall back to
restoring everyone), no-arg path unchanged, shellcheck clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 21:08:57 +00:00
Viktor Barzin
b5c6639272 t3-serve@: contain agent memory storms; survive child OOM kills
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Same t3-disconnect root-cause work: a runaway claude agent child grew to
10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off
its spinning disk (system-wide multi-10s freezes = every t3 client's 20s
watchdog firing = the 'frequent disconnects that self-recover'), then
the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min
because the default OOMPolicy=stop fails the unit when ANY cgroup child
is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid
swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue
so a runaway agent dies alone while the WS server keeps serving.
2026-06-10 21:00:06 +00:00
Viktor Barzin
d5fdc7ffe9 cloudflared: disable in-place autoupdate (--no-autoupdate)
Viktor asked to root-cause the frequent t3 code disconnects and rule
infra in or out. The tunnel pods ran bare 'cloudflared tunnel run':
every Cloudflare release made the binary self-update and exit (code 11),
restarting all 3 pods and severing every WebSocket riding the tunnel —
one of the confirmed infra-side drop causes (pods cycled 2026-06-09
20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts,
not in-place binary swaps.
2026-06-10 21:00:05 +00:00
Viktor Barzin
ac6f19dd3b tmux-persist: never let an empty snapshot clobber a saved manifest
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
emo's 5 web-terminal tmux sessions were OOM-killed (the server died, no
reboot), and the 5-minute save tick then overwrote his session manifest with
0 bytes — wiping the record that restore needs. Root cause: the save guard
only checked that the tmux socket *file* existed, but an OOM-killed server
leaves a stale /tmp/tmux-<uid>/default behind; list-panes then returns
nothing and that empty capture was installed over the good manifest. Because
the restore service only runs at boot, an OOM (not a reboot) skips restore
entirely, so the clobbered manifest was the only record left — and it was
already gone.

Fix: only overwrite <user>.tsv when the snapshot captured >=1 live session;
otherwise keep the last good manifest (now covers no-server AND
stale-socket/dead-server). Verified by reproducing the 0-byte clobber on the
old script and confirming the new one preserves the manifest, plus a live
save that still captures every active session.

emo's 5 sessions were recovered from their transcripts and are back; this
keeps the next OOM from destroying the manifest again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 20:38:59 +00:00
Viktor Barzin
9fff77cbea Merge branch 'wizard/budget-rate-limit'
Some checks failed
ci/woodpecker/push/default Pipeline failed
ci/woodpecker/push/build-cli Pipeline was successful
2026-06-10 19:42:19 +00:00
Viktor Barzin
acb847b858 actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses
The Actual web app boots with ~70 near-parallel requests (55
/data/migrations/*.sql + statics, all served cache-control max-age=0 so
every page load re-validates them). The shared rate-limit middleware
(average 10, burst 50) 429s the tail of that storm, so every cold boot
shows 'Server returned an error while checking its status' and every
load stalls in retry backoff — measured up to 5min stalls when two
loads from one IP overlap. Viktor asked to relax the limit after the
anca slow-load investigation (beads code-7zv).

Same pattern as immich: dedicated actualbudget-rate-limit middleware in
the traefik stack, budget-* ingresses opt out of the default via
skip_default_rate_limit + extra_middlewares.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:36:42 +00:00
Viktor Barzin
8304ef0f70 Merge origin/master (pfsense SNI-routed internal 443) into forgejo/master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Reconciles the two live infra remotes after the pve-host logging change landed
on forgejo (which was a commit behind origin). Non-destructive merge — keeps both
eae35c51 (pfsense webmail SNI routing) and aac807fb (pve-host Loki shipping).
2026-06-10 19:35:55 +00:00
Viktor Barzin
aac807fb3a pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated
shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730
fan daemon — through his agent. To keep an audit trail of what that agent does,
and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its
systemd journal to cluster Loki:

- snoopy logs every execve() to journald (identifier=snoopy), enabled via
  /etc/ld.so.preload; config scripts/pve-snoopy.ini.
- promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"}
  (full host journal; filter identifier="snoopy" for the command audit), and
  relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING
  only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}.

S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to
192.168.1.2, which is already in the S1 source-IP allowlist.

Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan);
follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers.

Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia
promtail — these files are the source of truth. Docs updated: security.md
(S1 LIVE) and monitoring.md ("External host: pve").

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 19:31:45 +00:00
Viktor Barzin
eae35c511a pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere
Completes the internal port table of the mail front door (10.0.20.1):
443 was squatted by the pfSense webGUI (self-signed cert expired 2022),
so internal webmail and the kuma [External] mail probe hit the firewall
login instead of Roundcube — the last leg of the mail split-brain name.

Design (Viktor): route by what the client asked for. New HAProxy
frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp):
SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge
pattern, no health check per the PROXY-probe gotcha); SNI of
pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI,
which moved to :8443 (invisible to habits — https://10.0.20.1 still
lands on the login page; :8443 doubles as direct fallback). The
reverse-proxy pfsense ingress now targets :8443 directly.

Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml
backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified:
bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI;
pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me ->
Roundcube with STRICT cert validation; :993 IMAPS untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:41:07 +00:00
Viktor Barzin
176a65d3d2 plotting-book: TF baseline image follows what CI actually builds
Viktor asked to verify the book-plotting push->build->deploy chain.
The chain itself is healthy, but the Terraform baseline image said
ancamilea/book-plotter:latest while CI (GHA on
PassionProjectsAnca/Plotting-Your-Dream-Book) builds and deploys
viktorbarzin/book-plotter:<sha8> + :latest — a from-scratch apply
would have resurrected a stale March image. Baseline now
viktorbarzin/book-plotter:latest. No live change: the running tag is
CI-owned via ignore_changes, plan confirms the image attr is ignored.

[ci skip] deliberately: plan shows UNRELATED pre-existing drift on
this stack (live ns labels managed-by=vault-user-onboarding +
resource-governance/custom-quota=true would be stripped; deployment
keel.sh/policy=patch annotations removed) — auto-applying that needs
its own reviewed pass.
2026-06-10 18:37:14 +00:00
Viktor Barzin
5f7c2964ac workstation: session-launch freshen follows the checked-out branch (not just master)
Viktor asked to log Anca into her GitHub account so she can develop on
the devvm and deploy her apps through the existing CI/CD. Her GitHub
repos (Plotting-Your-Dream-Book, travel, My-Wardrobe — now cloned into
her ~/code workspace) default to main, and the launcher freshen only
fast-forwarded master, silently skipping them. ff the current branch's
upstream instead — same safety gates (on a branch, clean tree, upstream
configured, ff-only). Single-layout infra clones behave identically.
[ci skip]
2026-06-10 18:20:59 +00:00
Viktor Barzin
de1d8b7bf3 technitium: add Brevo DKIM selector CNAMEs to internal zone [ci skip]
The roundtrip probe kept failing after the SPF/MX fix: rspamd's actual
junk-score driver was R_DKIM_PERMFAIL(+4.5) on selector brevo2 — Brevo
signs with brevo1/brevo2._domainkey, which are CNAMEs to
b{1,2}.viktorbarzin-me.dkim.brevo.com in public DNS and were absent
from the internal zone (the earlier existence check used ANY queries,
which Cloudflare refuses per RFC 8482 — false negative). The DKIM
permfail also cascaded into DMARC_POLICY_SOFTFAIL(+1.5), totalling the
6.09/6.0 junk threshold; sieve filed probes into \Junk where the INBOX
poll never finds them.

ingress-dns-sync now maintains both selector CNAMEs. Ops notes: rspamd
caches DNS (restart to flush after zone fixes); CoreDNS denial cache
holds NXDOMAINs up to 300s. Verified: roundtrip SUCCESS in 20.5s.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:07:38 +00:00
Viktor Barzin
2825cb1703 workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit)
Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone
itself; he wants ~/code to be the directory where all her project repos
(tripit etc.) live side by side, with infra moved to a subdirectory.

- roster.yaml gains per-user 'code_layout: single|workspace' + 'repos',
  validated + derived by roster_engine.py (12 new tests, 40 total).
- t3-provision-users reconcile: auto-migrates a single-layout ~/code to
  ~/code/infra (running processes follow the moved inode), hoists nested
  project clones to the workspace root, clones roster repos from Forgejo
  AS the user (their PAT makes private repos work), and wires the
  documented forgejo remote + forgejo/master upstream into clones that
  predate that contract.
- Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS
  read, shifting later fields left (groups was only safe by being the
  last field) — emit '-' sentinels instead.
- start-claude.sh session freshen is layout-aware (freshens each repo
  under ~/code for workspace users).
- managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md
  updated in the same change.

Applied live: ancamilea = workspace (infra at ~/code/infra, her existing
tripit clone hoisted to ~/code/tripit, master upstream switched to
forgejo/master); emo stays single layout, untouched. [ci skip]
2026-06-10 18:05:31 +00:00
Viktor Barzin
3b6a5c6737 workstation: worktree-first feature work for all agents [ci skip]
Viktor asked that every feature task be developed in its own git worktree
and merged into master when done, enabling multiple agents to work the
same project concurrently. Encode the org rule in the managed claudeMd
(self-deploys to /etc via the hourly reconcile), add the worktree-first
paragraph to the AGENTS.md non-admin landing recipe, and gitignore
.worktrees/ so per-feature worktrees can live at the repo root. Full
lifecycle: ~/.claude/rules/execution.md §3.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:49:43 +00:00
Viktor Barzin
daddafd279 docs: superset rule for the internal viktorbarzin.me zone (mail-auth records) [ci skip]
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:47:31 +00:00
Viktor Barzin
00bc1e052d technitium: mirror mail-auth records into internal zone; fix redfish check [ci skip]
Two fixes from the post-DNS-internalization health sweep:

1. The internal viktorbarzin.me zone served only ingress A/CNAME records.
   Since the mailserver pods now resolve the domain through it (CoreDNS
   viktorbarzin.me:53 -> Technitium, 59a531b8), rspamd's SPF checks on
   inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the
   Brevo email-roundtrip probe failed from the 16:20 run onward
   (EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also
   maintains the static mail-auth records (SPF, brevo-code TXT, MX;
   DMARC + DKIM were already present), idempotently. Principle: the
   internal zone must be a SUPERSET of the public zone for every record
   type internal clients consume. Verified in-pod: all four types
   resolve; roundtrip re-probe green.

2. cluster_healthcheck #30 queried instant `up`, which goes stale for
   ~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant
   job -> intermittent false "redfish-idrac=missing". Now uses
   last_over_time(up[15m]) — same answers for fast jobs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:46:37 +00:00
Viktor Barzin
e7fbf986fb workstation: rename tmux persistence out of the t3 namespace [ci skip]
Viktor's correction: this feature is about the tmux web-terminal
sessions, not t3 — t3 auto-saves its own threads (~/.t3 state +
daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist
(units tmux-persist-save.timer / tmux-persist-restore.service, state
/var/lib/tmux-persist), header rescoped to say exactly that. Same
mechanism, correct taxonomy. Old units removed, state migrated,
re-verified live (5 emo + 3 wizard sessions snapshotted).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:42:52 +00:00
Viktor Barzin
2e4f48f3fc workstation: tmux sessions survive devvm reboots (save timer + boot restore)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Viktor: emo's open web-terminal sessions must persist across reboots.
Claude conversations were already durable on disk; the volatile part
was the tmux wiring (which named session runs which conversation).

t3-tmux-sessions save (5-min timer) snapshots every roster user's
sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid
taken from argv --resume (self-sustaining once restored) or the
newest transcript in the cwd-slug project dir created after process
start (fresh launcher sessions; claude does NOT hold its transcript
fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore
(boot oneshot, also safe after partial loss) recreates missing
sessions with claude --resume <uuid>. Reconciler self-heals both
units' enablement.

Verified live: emo's 5 sessions snapshotted with correct uuids;
killed R730-cooling -> restore brought it back resuming the same
conversation (context meter identical); other sessions untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:39:32 +00:00
Viktor Barzin
59a531b8e0 coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip]
Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).

Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.

Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:21:34 +00:00
Viktor Barzin
35c89fa90c workstation: managed Claude config self-deploys from the repo [ci skip]
Viktor's claudeMd edits must keep reaching every user now that emo is
out of the shared tree. Two reconciler additions:
- sync_managed_config: installs scripts/workstation/managed-settings.json
  to /etc/claude-code whenever the repo copy changes — editing the
  org claudeMd is now edit + commit, no manual install step
- refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md
  (static mirror of the claudeMd; header-guarded so user-customized
  files are never clobbered)

Verified live: corrupted emo's mirror -> reconcile restored it;
wizard's stale mirror refreshed; in-sync managed config no-ops.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:03:24 +00:00
Viktor Barzin
8cfd0e5e5c Merge forgejo/master: reconcile diverged lineages [ci skip]
Local checkout carried the 2026-06-10 DNS/registry architecture series
(pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes
stock) + vzdump/nfs-mirror/workstation-rebuild commits that never
reached the canonical remote, while forgejo master received the
emo-access series via isolated worktrees. Viktor asked to merge.

Conflict resolutions (newest iteration wins in each file):
- stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert
  after live retention orphaned OCI indexes; remote had 06-09 enable)
- .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final
  registry/DNS architecture + implemented vzdump alerts
- scripts/workstation/setup-devvm.sh: LOCAL — pinned-version,
  reproducible-rebuild refactor (kubelogin pin, restructured staging)
- scripts/workstation/managed-settings.json: FORGEJO — the
  allow-then-audit claudeMd (matches /etc deployment byte-for-byte)
- scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone
  intact

[ci skip]: all stack changes in the local lineage were applied live
this morning — CI would re-walk 100+ stacks via the modules/ fallback
for zero state change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:21:50 +00:00
Viktor Barzin
a34f9ff3b8 docs: infra Woodpecker repo-82 ops — in-cluster webhook, secret parity, empty-commit gotcha [ci skip]
Emo's first direct pushes surfaced three latent CI issues, all fixed
out-of-band today and recorded here: webhook deliveries to
ci.viktorbarzin.me timing out on the public-IP hairpin (hook now
targets the in-cluster woodpecker-server service), repo 82 registered
without the repo-scoped secret set (cloned from repo 1 in the DB), and
empty commits compiling every workflow so missing secrets hard-error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:09:17 +00:00
63161ef3a5 test: final audit-pipeline verification
Some checks failed
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/k8s-portal Pipeline failed
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful
ci/woodpecker/push/registry-config-sync Pipeline was successful
ci/woodpecker/push/build-ci-image Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Repo-82 Woodpecker secrets were missing (repo-1 set cloned over) and
the webhook now targets the in-cluster service. This push should run
the full pipeline: Slack audit ping + no-op apply.
2026-06-10 15:07:15 +00:00
619b7608fa test: verify audit pipeline fires on emo push
Second verification: the Forgejo->Woodpecker webhook was timing out on
the public-IP hairpin (first test push fired no pipeline), so it now
targets the in-cluster Woodpecker service. This push should produce a
pipeline with the notify-nonadmin-push Slack step.
2026-06-10 15:03:48 +00:00
0f45585b53 test: verify emo direct master push (allow-then-audit)
Viktor granted emo direct push to master on 2026-06-10 — any change
allowed, tracked via commit messages + the Slack audit feed. This
empty commit verifies the whitelist and exercises the new
notify-nonadmin-push CI step end-to-end.
2026-06-10 14:54:04 +00:00
Viktor Barzin
a49d1eadf6 workstation: emo direct master push — allow-then-audit [ci skip]
Viktor: emo may make any change; what matters is tracking what changed
and why. ebarzin added to master push+merge whitelists (force-push
stays disabled — append-only history). Tracking enforced three ways:
- agent instructions (managed claudeMd + AGENTS.md): commit body MUST
  carry the user's plain-language intent; commits land on master
  directly; [ci skip] forbidden for non-admins
- new notify-nonadmin-push step in .woodpecker/default.yml: Slack
  message for every non-admin master push (admin pushes silent)
- PR flow remains the fallback for non-whitelisted users

Accepted consequence (informed): emo's pushes auto-apply changed
stacks via CI. Offboard runbook gains whitelist-removal step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:53:43 +00:00
Viktor Barzin
6d8773573c workstation: agent-driven contribute flow for non-technical users [ci skip]
emo can't use git — his agent must do all VCS mechanics invisibly.
Managed claudeMd (every session, top precedence) now instructs agents:
commit -> push <os-user>/<topic> branch -> open PR via Forgejo API
(user's PAT from ~/.git-credentials) -> back to clean master -> tell
the user in plain words it's submitted for review. AGENTS.md carries
the full recipe with the curl call.

Verified live as emo: PR #1 opened (HTTP 201, write:repository scope
suffices) and closed via his PAT. Deployed to
/etc/claude-code/managed-settings.json; codex AGENTS.md mirrors for
emo + ancamilea regenerated from the new claudeMd.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 10:12:26 +00:00
Viktor Barzin
2e5af5dc0e workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip]
Non-admins (emo) need current master without manual pulls. Two layers:
- t3-provision-users reconcile gains refresh_locked_clone: fetch all
  remotes + ff-only master, guarded (on master, clean tree, upstream
  set); dirty/diverged clones are left alone with a WARN.
- start-claude.sh freshens ~/code at session launch, 15s-capped so an
  offline remote never delays the session.

Verified live on emo's clone: stale clone ff'd to tip by the
reconciler; launcher snippet ff's when clean and refuses while a
dirty file exists. Deployed to /usr/local/bin/t3-provision-users,
/etc/skel/start-claude.sh, and emo's launcher.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:41:38 +00:00
Viktor Barzin
5d9417fbaa workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip]
ADR-0004's premise was wrong: pushing master fires the Woodpecker apply
pipeline (require_approval=forks only), so master pushes ARE deploys.
Added Forgejo branch protection on master (push/merge whitelist=viktor,
deploy keys allowed); non-admins contribute via branches + PRs.

emo (ebarzin): write collaborator on viktor/infra, PAT in
~/.git-credentials, forgejo remote + upstream in his locked clone.
Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they
ARE the skel shared-base mechanism — plan step 4c obsolete).
Offboard runbook: revoke PAT + collaborator + group steps added.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:30:41 +00:00
Viktor Barzin
a1b7b0ca53 forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip]
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).

Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 09:22:47 +00:00
Viktor Barzin
e49c91e60c monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.

NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:10:46 +00:00
Viktor Barzin
05f928931f workstation: packages.txt — add provisioner build deps + uncaptured core tools
setup-devvm.sh now needs golang-go (builds t3-dispatch in section 9) and uses unzip
(kubelogin extraction); neither was in the manifest, so a fresh box would skip the
t3-dispatch build. Also add build-essential (cgo / npm native modules) + core tools
that were manually-installed but uncaptured (rsync, wget, tree, shellcheck). Noted
gh as non-apt (GitHub's own repo). All verified to resolve in apt.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:08:53 +00:00
Viktor Barzin
312c418a9a workstation: setup-devvm.sh installs the systemd service layer (reproducible rebuild)
The t3 system units (t3-serve@, t3-autoupdate, t3-backup-state, t3-provision-users,
t3-dispatch) + the t3-dispatch Go binary + t3-mint + the sudoers grant were all
hand-scp'd and would NOT survive a fresh devvm. setup-devvm.sh now installs + enables
them: build-if-absent for the Go binary, visudo-validated sudoers (a malformed
/etc/sudoers.d file breaks all sudo), timers self-heal, t3-dispatch system account
created if absent. t3-serve@ stays a per-user template enabled by the provisioner;
the ttyd terminal-lobby chain ships from its own repo (viktor/terminal-lobby).

Verified: shellcheck clean, go build compiles, visudo parses the sudoers, units parse.
NOT run live (would re-assert apt/npm on the shared host) — exercised on next rebuild.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:07:20 +00:00
Viktor Barzin
d9ea7812f5 nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly
nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup
dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms
(added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the
02:00 nfs-mirror run silently deleted both successful 40G devvm images
(verified: dir gone, 40G freed, despite status=0 success logs). Add
--exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/
sqlite-backup excludes that exist for exactly this reason. TDD-proven with an
isolated rsync --delete -n -v. backup-dr.md notes the dependency.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:04:57 +00:00
Viktor Barzin
2b8c0def30 dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip]
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):

- pfSense Unbound domain override viktorbarzin.me -> Technitium
  10.0.20.201 (applied via php write_config, backup on-box). Every
  Unbound client on every VLAN now gets the internal split-horizon
  answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
  forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
  the ETP=Local LB IP pfSense now returns), all other .me names kept on
  public resolvers (pods' pre-existing behavior). Replaces the .:53
  forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
  node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
  for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
  bullet, post-mortem final-architecture addendum.

Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 08:32:34 +00:00
Viktor Barzin
1ee1bf0817 forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip]
Supersedes this morning's per-node /etc/hosts pin (no hardcoded service
IPs on nodes, per Viktor). Technitium's split-horizon zone already
resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP
(ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe
alerts) -- the nodes just never queried it. Rolled the devvm's
systemd-resolved routing-domain pattern (~viktorbarzin.me ->
10.0.20.201) to all 7 nodes, removed the pins, verified getent +
crictl pull via pure DNS.

Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1)
to FallbackDNS-only: public servers in the global set race the routing
domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete
-- exactly the stale comment that pointed new nodes at the hairpin.

hosts.toml mirror kept but documented as vestigial (Traefik 404s
bare-IP requests; registry auth realm is an absolute URL).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:56:31 +00:00
Viktor Barzin
b6976ce014 forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip]
tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet
pulls of forgejo.viktorbarzin.me images depended on the intermittently
broken public-IP hairpin. The containerd hosts.toml mirror cannot keep
pulls internal on its own — Traefik 404s its bare-IP requests (no
Host/SNI match) and the registry Bearer realm is an absolute public URL
fetched outside the mirror. Third incident of this class (buildkit
06-04, tripit/devvm 06-09).

Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node —
covers resolve + token + blob legs with correct SNI and valid cert.
Applied live to all 7 nodes; persisted in the cloud-init bootstrap and
the existing-node rollout script. Docs updated (registry bullet, dns.md
hairpin scope + stale .200 literals, runbook) + post-mortem.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 07:15:24 +00:00
Viktor Barzin
eb8695743b workstation: fix setup-devvm.sh provisioner correctness (claude detect, kubelogin pin, codex auth, t3-serve dir)
- claude-code: detect via `npm ls -g` not `command -v claude` — the admin's
  personal ~/.local/bin/claude shadowed the PATH check, so the system-wide
  install never ran (/usr/lib/node_modules/@anthropic-ai empty, no /usr/bin/claude;
  fresh non-admins had no claude). Found during the devvm reproducibility audit.
- kubelogin: pin v1.36.2 instead of releases/latest/download, so two fresh boxes
  built weeks apart are byte-identical.
- /etc/t3-serve: mkdir before the token writes (install -m doesn't create the
  parent — section 8 would fail on a fresh box).
- codex shared auth: stage /opt/codex-shared/auth.json from Vault
  secret/workstation.codex_shared_auth_json (key already existed but nothing
  consumed it — was a manual step lost on rebuild), mirroring the Claude token.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
8886ac7763 backup: fix vzdump-vms exit code — EXIT-trap && short-circuit falsely failed OK runs
First live run produced a valid 40G dump and logged status=0, but the service
exited 1/FAILURE: cleanup() used `[ -n "$KILLED" ] && push_metrics 2 0`, and a
bash EXIT trap whose LAST command returns non-zero overrides the script's
`exit 0`. With KILLED empty the && short-circuits -> returns 1 -> a successful
backup is marked failed (would trip a vzdump staleness/failure alert). Switch to
daily-backup's `if…fi` idiom (returns 0 when not killed). Bug reproduced + fix
verified locally; redeployed to PVE + reset-failed.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
7330cb6a0b backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap
The hand-managed Linux VMs (not in Terraform) were never imaged: the
PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost
devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has
no remote).

vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of
VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the
monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent
enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd.
Pushgateway job vzdump-backup.

Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image
layer + protection matrix), infra CLAUDE.md, AGENTS.md.

[ci skip]

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:54 +00:00
Viktor Barzin
3e7093947d t3: bump pin 0.0.24 -> 0.0.26 (fable-5) [ci skip]
Completes the 0.0.26 adoption prepared in fcb84ce0 (version-agnostic
dispatch browser-session/bootstrap fallback + Gate-2 real pairing
health-check + per-user state.sqlite backup). 0.0.26 verified
end-to-end on the devvm: emo + ancamilea auto-pair via t3-dispatch
(302 + Set-Cookie t3_session) after migrating state.sqlite 30->32;
pre-cutover backups in /var/backups/t3-state. Brings claude-fable-5
into the t3 model picker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
dacd9d2d8a t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip]
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):

- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
  /api/auth/bootstrap on 404 — one binary pairs across both versions and any
  rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
  before, green after). Built, deployed, verified live on 0.0.24 (all three
  users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
  handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
  build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
  state.sqlite (was the only copy, unbacked) -> the one-way forward schema
  migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
baac46415f t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip]
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).

- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
  nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
  immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.

Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
41c11216da t3-dispatch: re-pair on present-but-invalid t3_session cookie
The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09
auth-schema rollback wiped all server-side sessions, browsers kept dead
30-day t3_session cookies; the dispatcher proxied them straight through
and t3 rendered its pair page ("all users must pair again").

Now a present cookie on a top-level document navigation is validated via
the instance's /api/auth/session and re-paired on authenticated:false.
Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html)
so XHR/asset/WebSocket sub-requests are never answered with a 302; fails
open (proxy through) on any validation error. Unit + handler tests added.

[ci skip]

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
e0ab621cb2 workstation: fix new-user .env clobber — env_set preserves CLAUDE_CODE_OAUTH_TOKEN
The port-write used '>' (overwrite), wiping the token injected earlier in the same run for a NEW user (existing users like anca survived only because their .env already had the T3_PORT line). New env_set() does update-or-append per key, preserving others. Verified end-to-end: throwaway t3probe provisioned from scratch -> .env has both T3_PORT + CLAUDE_CODE_OAUTH_TOKEN -> claude -p AUTHOK. So all new non-admins now authenticate automatically. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
39e35ca8c9 workstation: share admin Claude subscription with non-admins (CLAUDE_CODE_OAUTH_TOKEN)
Non-admins without their own ~/.claude login get the shared long-lived sk-ant-oat01 token injected into their t3-serve env, so their agent authenticates against the admin's subscription. setup-devvm.sh stages it from Vault secret/workstation.claude_oauth_token (root-readable); the provisioner's install_user_claude_token injects per-user, if-absent (never clobbers emo's own login). Live-fixed anca (verified AUTHOK); this codifies it for reproducibility + future users. NOT pushed (shared-tree divergence hold).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00
Viktor Barzin
1edccedb1f workstation: v2 membership implementation plan [ci skip]
8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:41:53 +00:00