infra

Author	SHA1	Message	Date
Viktor Barzin	acb847b858	actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses The Actual web app boots with ~70 near-parallel requests (55 /data/migrations/.sql + statics, all served cache-control max-age=0 so every page load re-validates them). The shared rate-limit middleware (average 10, burst 50) 429s the tail of that storm, so every cold boot shows 'Server returned an error while checking its status' and every load stalls in retry backoff — measured up to 5min stalls when two loads from one IP overlap. Viktor asked to relax the limit after the anca slow-load investigation (beads code-7zv). Same pattern as immich: dedicated actualbudget-rate-limit middleware in the traefik stack, budget- ingresses opt out of the default via skip_default_rate_limit + extra_middlewares. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:36:42 +00:00
Viktor Barzin	aac807fb3a	pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730 fan daemon — through his agent. To keep an audit trail of what that agent does, and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its systemd journal to cluster Loki: - snoopy logs every execve() to journald (identifier=snoopy), enabled via /etc/ld.so.preload; config scripts/pve-snoopy.ini. - promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"} (full host journal; filter identifier="snoopy" for the command audit), and relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}. S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to 192.168.1.2, which is already in the S1 source-IP allowlist. Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan); follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers. Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia promtail — these files are the source of truth. Docs updated: security.md (S1 LIVE) and monitoring.md ("External host: pve"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 19:31:45 +00:00
Viktor Barzin	2825cb1703	workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit) Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone itself; he wants ~/code to be the directory where all her project repos (tripit etc.) live side by side, with infra moved to a subdirectory. - roster.yaml gains per-user 'code_layout: single\|workspace' + 'repos', validated + derived by roster_engine.py (12 new tests, 40 total). - t3-provision-users reconcile: auto-migrates a single-layout ~/code to ~/code/infra (running processes follow the moved inode), hoists nested project clones to the workspace root, clones roster repos from Forgejo AS the user (their PAT makes private repos work), and wires the documented forgejo remote + forgejo/master upstream into clones that predate that contract. - Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS read, shifting later fields left (groups was only safe by being the last field) — emit '-' sentinels instead. - start-claude.sh session freshen is layout-aware (freshens each repo under ~/code for workspace users). - managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md updated in the same change. Applied live: ancamilea = workspace (infra at ~/code/infra, her existing tripit clone hoisted to ~/code/tripit, master upstream switched to forgejo/master); emo stays single layout, untouched. [ci skip]	2026-06-10 18:05:31 +00:00
Viktor Barzin	daddafd279	docs: superset rule for the internal viktorbarzin.me zone (mail-auth records) [ci skip] Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:47:31 +00:00
Viktor Barzin	e7fbf986fb	workstation: rename tmux persistence out of the t3 namespace [ci skip] Viktor's correction: this feature is about the tmux web-terminal sessions, not t3 — t3 auto-saves its own threads (~/.t3 state + daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist (units tmux-persist-save.timer / tmux-persist-restore.service, state /var/lib/tmux-persist), header rescoped to say exactly that. Same mechanism, correct taxonomy. Old units removed, state migrated, re-verified live (5 emo + 3 wizard sessions snapshotted). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:42:52 +00:00
Viktor Barzin	2e4f48f3fc	workstation: tmux sessions survive devvm reboots (save timer + boot restore) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor: emo's open web-terminal sessions must persist across reboots. Claude conversations were already durable on disk; the volatile part was the tmux wiring (which named session runs which conversation). t3-tmux-sessions save (5-min timer) snapshots every roster user's sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid taken from argv --resume (self-sustaining once restored) or the newest transcript in the cwd-slug project dir created after process start (fresh launcher sessions; claude does NOT hold its transcript fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore (boot oneshot, also safe after partial loss) recreates missing sessions with claude --resume <uuid>. Reconciler self-heals both units' enablement. Verified live: emo's 5 sessions snapshotted with correct uuids; killed R730-cooling -> restore brought it back resuming the same conversation (context meter identical); other sessions untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:39:32 +00:00
Viktor Barzin	59a531b8e0	coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip] Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP (10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods become ordinary internal clients (CNAME -> apex -> live Traefik LB; mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma monitors that rode the TP-Link NAT loopback (hard-down since 06-09; loopback refuses flows whose source equals the reflection target, which all pfSense-SNAT'd cluster traffic does). Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic to LB IPs; verified from pods on three non-Traefik nodes) — re-verify after major k8s upgrades; canary = [External] fleet going red. The NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both fight return-path asymmetry and deepen TP-Link dependency. Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1, forgejo -> Traefik ClusterIP (pin kept for Technitium-outage resilience). Proxied [External] monitors now test the internal path — true edge fidelity moves to the external vantage (ha-london, next fix). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:21:34 +00:00
Viktor Barzin	35c89fa90c	workstation: managed Claude config self-deploys from the repo [ci skip] Viktor's claudeMd edits must keep reaching every user now that emo is out of the shared tree. Two reconciler additions: - sync_managed_config: installs scripts/workstation/managed-settings.json to /etc/claude-code whenever the repo copy changes — editing the org claudeMd is now edit + commit, no manual install step - refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md (static mirror of the claudeMd; header-guarded so user-customized files are never clobbered) Verified live: corrupted emo's mirror -> reconcile restored it; wizard's stale mirror refreshed; in-sync managed config no-ops. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:03:24 +00:00
Viktor Barzin	8cfd0e5e5c	Merge forgejo/master: reconcile diverged lineages [ci skip] Local checkout carried the 2026-06-10 DNS/registry architecture series (pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes stock) + vzdump/nfs-mirror/workstation-rebuild commits that never reached the canonical remote, while forgejo master received the emo-access series via isolated worktrees. Viktor asked to merge. Conflict resolutions (newest iteration wins in each file): - stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert after live retention orphaned OCI indexes; remote had 06-09 enable) - .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final registry/DNS architecture + implemented vzdump alerts - scripts/workstation/setup-devvm.sh: LOCAL — pinned-version, reproducible-rebuild refactor (kubelogin pin, restructured staging) - scripts/workstation/managed-settings.json: FORGEJO — the allow-then-audit claudeMd (matches /etc deployment byte-for-byte) - scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone intact [ci skip]: all stack changes in the local lineage were applied live this morning — CI would re-walk 100+ stacks via the modules/ fallback for zero state change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:21:50 +00:00
Viktor Barzin	a34f9ff3b8	docs: infra Woodpecker repo-82 ops — in-cluster webhook, secret parity, empty-commit gotcha [ci skip] Emo's first direct pushes surfaced three latent CI issues, all fixed out-of-band today and recorded here: webhook deliveries to ci.viktorbarzin.me timing out on the public-IP hairpin (hook now targets the in-cluster woodpecker-server service), repo 82 registered without the repo-scoped secret set (cloned from repo 1 in the DB), and empty commits compiling every workflow so missing secrets hard-error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:09:17 +00:00
Viktor Barzin	a49d1eadf6	workstation: emo direct master push — allow-then-audit [ci skip] Viktor: emo may make any change; what matters is tracking what changed and why. ebarzin added to master push+merge whitelists (force-push stays disabled — append-only history). Tracking enforced three ways: - agent instructions (managed claudeMd + AGENTS.md): commit body MUST carry the user's plain-language intent; commits land on master directly; [ci skip] forbidden for non-admins - new notify-nonadmin-push step in .woodpecker/default.yml: Slack message for every non-admin master push (admin pushes silent) - PR flow remains the fallback for non-whitelisted users Accepted consequence (informed): emo's pushes auto-apply changed stacks via CI. Offboard runbook gains whitelist-removal step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:53:43 +00:00
Viktor Barzin	2e5af5dc0e	workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip] Non-admins (emo) need current master without manual pulls. Two layers: - t3-provision-users reconcile gains refresh_locked_clone: fetch all remotes + ff-only master, guarded (on master, clean tree, upstream set); dirty/diverged clones are left alone with a WARN. - start-claude.sh freshens ~/code at session launch, 15s-capped so an offline remote never delays the session. Verified live on emo's clone: stale clone ff'd to tip by the reconciler; launcher snippet ff's when clean and refuses while a dirty file exists. Deployed to /usr/local/bin/t3-provision-users, /etc/skel/start-claude.sh, and emo's launcher. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:41:38 +00:00
Viktor Barzin	5d9417fbaa	workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip] ADR-0004's premise was wrong: pushing master fires the Woodpecker apply pipeline (require_approval=forks only), so master pushes ARE deploys. Added Forgejo branch protection on master (push/merge whitelist=viktor, deploy keys allowed); non-admins contribute via branches + PRs. emo (ebarzin): write collaborator on viktor/infra, PAT in ~/.git-credentials, forgejo remote + upstream in his locked clone. Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they ARE the skel shared-base mechanism — plan step 4c obsolete). Offboard runbook: revoke PAT + collaborator + group steps added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:30:41 +00:00
Viktor Barzin	a1b7b0ca53	forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip] The keep-set (newest 10 versions + latest + cache tags) treats multi-arch/attestation index CHILDREN — separate untagged sha256 versions — as deletable: for images not rebuilt recently they sort outside the newest-10 window and were pruned while their kept parent index survived. kms-website :latest and :dfc83fb children 404'd (RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe within hours; deployed tag a794d1a unaffected). Healed: :latest re-pointed at the intact a794d1a index (also the newest commit), corrupt :dfc83fb version deleted, probe re-run clean (0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied live. Re-enable only with a container-aware keep-set — options in the post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:22:47 +00:00
Viktor Barzin	e49c91e60c	monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl, mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success. NOT [ci]-applied: this is a Terraform stack change — arms on the next `scripts/tg apply` of the monitoring stack (metrics already flow, so it arms immediately once applied). Admin-gated apply per org policy. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:10:46 +00:00
Viktor Barzin	d9ea7812f5	nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms (added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the 02:00 nfs-mirror run silently deleted both successful 40G devvm images (verified: dir gone, 40G freed, despite status=0 success logs). Add --exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/ sqlite-backup excludes that exist for exactly this reason. TDD-proven with an isolated rsync --delete -n -v. backup-dr.md notes the dependency. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:04:57 +00:00
Viktor Barzin	2b8c0def30	dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 08:32:34 +00:00
Viktor Barzin	1ee1bf0817	forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip] Supersedes this morning's per-node /etc/hosts pin (no hardcoded service IPs on nodes, per Viktor). Technitium's split-horizon zone already resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP (ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe alerts) -- the nodes just never queried it. Rolled the devvm's systemd-resolved routing-domain pattern (~viktorbarzin.me -> 10.0.20.201) to all 7 nodes, removed the pins, verified getent + crictl pull via pure DNS. Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1) to FallbackDNS-only: public servers in the global set race the routing domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete -- exactly the stale comment that pointed new nodes at the hairpin. hosts.toml mirror kept but documented as vestigial (Traefik 404s bare-IP requests; registry auth realm is an absolute URL). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:56:31 +00:00
Viktor Barzin	b6976ce014	forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip] tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet pulls of forgejo.viktorbarzin.me images depended on the intermittently broken public-IP hairpin. The containerd hosts.toml mirror cannot keep pulls internal on its own — Traefik 404s its bare-IP requests (no Host/SNI match) and the registry Bearer realm is an absolute public URL fetched outside the mirror. Third incident of this class (buildkit 06-04, tripit/devvm 06-09). Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node — covers resolve + token + blob legs with correct SNI and valid cert. Applied live to all 7 nodes; persisted in the cloud-init bootstrap and the existing-node rollout script. Docs updated (registry bullet, dns.md hairpin scope + stale .200 literals, runbook) + post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:15:24 +00:00
Viktor Barzin	7330cb6a0b	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	dacd9d2d8a	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	baac46415f	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	41c11216da	t3-dispatch: re-pair on present-but-invalid t3_session cookie The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09 auth-schema rollback wiped all server-side sessions, browsers kept dead 30-day t3_session cookies; the dispatcher proxied them straight through and t3 rendered its pair page ("all users must pair again"). Now a present cookie on a top-level document navigation is validated via the instance's /api/auth/session and re-paired on authenticated:false. Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html) so XHR/asset/WebSocket sub-requests are never answered with a 302; fails open (proxy through) on any validation error. Unit + handler tests added. [ci skip] Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	1edccedb1f	workstation: v2 membership implementation plan [ci skip] 8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:53 +00:00
Viktor Barzin	83f418159a	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:30:19 +00:00
Viktor Barzin	bccaa08d8e	t3: prepare to adopt 0.0.25 — version-agnostic dispatch + real pairing health-check + state backup [ci skip] Investigated the 0.0.25 break: it is ONLY an endpoint rename (/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing contract (credential payload, t3_session cookie, /api/auth/session) is byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep): - t3-dispatch: autoPair tries /api/auth/browser-session, falls back to /api/auth/bootstrap on 404 — one binary pairs across both versions and any rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25 before, green after). Built, deployed, verified live on 0.0.24 (all three users still 302 + t3_session via the fallback). - t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad build now auto-rolls-back. Validated against both versions. - t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3 state.sqlite (was the only copy, unbacked) -> the one-way forward schema migration becomes a restore, not sqlite surgery. timeout-guarded. - runbooks/t3-version-bump.md: the reversible cutover checklist. - post-mortem #5 (health-check) DONE + #6 added; service-catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	5ea238c707	t3: pin t3@0.0.24 + stop nightly auto-update (auth-outage fix) [ci skip] The t3-autoupdate timer (re-enabled by the provisioner's step 5b with `--now`, which fires the missed daily job immediately on a Persistent timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions role->scopes, +proof_key_thumbprint) AND changed the bootstrap API, breaking t3-mint/pairing for ALL devvm users (pair prompt, no session). - t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a nightly tracker -- re-asserts the pin (a no-op when correct). - t3-provision-users.sh step 5b: drop `--now` (it triggered the immediate missed-job run that pulled the bad build). - setup-devvm.sh: install pinned t3@0.0.24 at machine setup. - unit Descriptions + service-catalog reflect the pin. - post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md. Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled the (now-pinned) enforcer, reset the 2 new users' disposable DBs, surgically reverted wizard's auth tables to level-30 (96 threads + live session preserved). All users verified 302 + t3_session. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	2125651aaa	t3-dispatch: re-pair on present-but-invalid t3_session cookie The dispatcher only re-paired on an ABSENT cookie. After the 2026-06-09 auth-schema rollback wiped all server-side sessions, browsers kept dead 30-day t3_session cookies; the dispatcher proxied them straight through and t3 rendered its pair page ("all users must pair again"). Now a present cookie on a top-level document navigation is validated via the instance's /api/auth/session and re-paired on authenticated:false. Gated to document navs (Sec-Fetch-Dest: document, else Accept: text/html) so XHR/asset/WebSocket sub-requests are never answered with a 302; fails open (proxy through) on any validation error. Unit + handler tests added. [ci skip] Co-Authored-By: Claude <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	fbcc330214	workstation: v2 membership implementation plan [ci skip] 8 tasks: engine derive_os_user + roster_from_members (TDD); read-only Authentik token (TF); setup-devvm.sh stages it; provisioner sources T3 Users members from the Authentik API (replaces roster.yaml); Authentik-managed membership + legacy os_user attributes; retire roster.yaml; e2e add/remove smoke. Pairs with the 2026-06-09 design doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:21:39 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	05b50d2b96	workstation: v2 membership design — Authentik-group-driven, email-identified [ci skip] Supersedes the MEMBERSHIP model of the 2026-06-07 design (roster.yaml SSoT). Key principle: workstation access (T3 Users group membership) is decoupled from cluster authorization (k8s_users + kubernetes-* groups, untouched). A user is defined once in Authentik: email + T3 Users membership + optional os_user attribute. Provisioner reconciles accounts from the Authentik API; roster.yaml retires. v1 foundation (config inheritance, locked clone, kubeconfig, swap, hardening, emo cutover) unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 07:19:25 +00:00
Viktor Barzin	37626cb89b	workstation: docs — mark RBAC + Authentik gate applied [ci skip] multi-tenancy.md + service-catalog.md status: per-user OIDC kubeconfig, oidc-power-user-readonly ClusterRole, emo k8s_users entry, and the Authentik T3 Users edge gate are now applied + verified. Remaining: emo cutover (Phase 5, held), offboarding apply-side (Phase 7), per-user MCP injection, roster-reconciled group membership. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:51:44 +00:00
Viktor Barzin	c611ecf84d	workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip] multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:27:17 +00:00
Viktor Barzin	3feb69e379	workstation: pin verified config-inheritance mechanism in design §4 [ci skip] Spike GO (claude 2.1.168): managed claudeMd reaches a session; no managed-skills key exists so skills/rules inherit via per-user ~/.claude symlinks to the base (seeded in /etc/skel). Records the settings.json 0664->0600 leak fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:09:13 +00:00
Viktor Barzin	6504911a77	matrix: open (tokenless) registration + bot mitigations + #security alert User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser challenges break native clients). Bot defense is layered instead: - Traefik rate-limit Middleware on a path-scoped /register ingress carve-out, keyed on request Host (GLOBAL /register cap) not source IP — the host is reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min, burst 20, per replica; CrowdSec is the hard backstop on both paths. - Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing #security Slack receiver (matches "registered on this server", never the rejection line). tuwunel's admin bot also posts signups to the admin room. Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert). Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so [ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 13:27:02 +00:00
Viktor Barzin	bb7bcf803b	multi-user-workstation: design + phased implementation plan devvm multi-user Claude Code workstation: role-driven profiles (admin/power-user/namespace-owner) off one git roster (single source of truth, full onboard->offboard lifecycle); config inheritance via Claude's native machine-wide managed layer; per-user writable git-crypt-locked infra clone (ungated, apply-time is the boundary); per-tier OIDC kubectl; per-user secrets/auth (memory isolation deferred); incremental, emo-safe migration; capacity prereqs. Folds in gap-analysis findings verified live 2026-06-08. Designed, not yet implemented. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 12:58:29 +00:00
Viktor Barzin	3d6c5b8bc7	matrix/authentik: remove orphaned Matrix OAuth2 app + provider (post-tuwunel) The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel uses native password auth, so nothing consumed it. Deleted application `matrix` + OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale Matrix rows from the SSO reference tables and update the plan's residual list. Doc-only [ci skip]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 12:32:49 +00:00
Viktor Barzin	23602f393e	matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated) Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB drops the CNPG dependency (both init-containers, the db ESO, the Reloader annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation on, tuwunel-served well-known delegation to :443. server_name unchanged (matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path). Registered @viktor admin then disabled registration (403). Cleanup: removed the orphaned pg-matrix Vault static role and dropped the matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*. Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so [ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC tune-TTL drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 11:58:17 +00:00
Viktor Barzin	d4ec5768b2	vault-token-renew: version the devvm renewer + user units in the repo The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.) - scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access). - scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions). - docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery. - docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
Viktor Barzin	551412488b	apiserver: enable audit logging (low-write Metadata) + ship to Loki Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details Resource changes/deletions are now attributable (the novelapp deletion this week was untraceable because apiserver audit was off). Low-write policy: drops reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into the kube-apiserver static-pod manifest + kubeadm-config (v1beta4 extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails /var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}. Root cause that had silently blocked this AND OIDC for weeks: a stray kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate static-pod manifest kubelet ran instead of the real one, dropping every flag added to the real manifest. Removed it. Runbook added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	9529eedfe0	docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip] Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to return 200 instead of proxying to the scaled-to-0 poison-fountain. - security.md Layer 1 + tarpit description + troubleshooting (fix stale stacks/platform path -> traefik stack; drop misleading restart-poison-fountain step). - .claude/CLAUDE.md: add matrix to PG rotation list; document that startup-read secret consumers need a Reloader annotation (matrix root cause, found via Loki 2026-06-05). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	d808694af4	docs(storage): record harden-half shipped (orphan cleanup + ghost-reconcile) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details 2a orphan cleanup (67 Released PVs + 475 LVs removed, VG pve 997->~410) + 2b csi-ghost-reconcile CronJob done — ghost-disk doom loop closed by construction, beads code-dfjn retireable. Cap kept at 28 (lowering would reverse the 2026-05-25 eviction-cascade post-mortem fix). Phase-1: insta2spotify migrated (noted its 3.26GB image re-pull blip on node reschedule). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:39:36 +00:00
Viktor Barzin	63182730f9	docs(storage): record Wave-2 NFS migration + harden-proxmox-csi decision (option 1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS (tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots. - storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history, updated cap-relief levers - topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the decision doc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:15:21 +00:00
Viktor Barzin	c24b4a21d8	docs(architecture): fix stale 5-node claim -> 7 nodes (k8s-node1..6) [ci skip] Cluster grew to 7 nodes (k8s-master + node1..6; node5/6 added ~10d ago) but several docs still said "5 nodes". Corrected with live specs: - overview.md: 7-node enumeration; node1 is 16c/48GB (doc wrongly said 32GB), node2-6 are 8c/32GB general workers - compute.md: "5-node" -> "7-node" cluster description - dns.md: NodeLocal DNSCache DaemonSet "5 nodes" -> "7 nodes" - mailserver.md: HAProxy backend diagram "node1..4" -> "node1..6" Illustrative "0/5 nodes available" scheduler-error examples left as-is. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:03:58 +00:00
Viktor Barzin	52f5de905d	docs(context): freshen infra glossary (modules, tiers, new concepts) [ci skip] Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs): - Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app factory modules (never existed); name the real four (ingress_factory, nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local / flat distinction; flag vestigial modules/kubernetes/<app> dirs. - Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers); reserve "tier" for State tier + Namespace tier only. - Add local-path entry (cluster default SC; node-local footgun warning). - Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico. - Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC). - Fix node count 5 -> 7 (k8s-master + k8s-node1..6). Doc-sync (same commit per repo rules): - overview.md: replace fictional factory modules with the real shared modules + the flat/stack-local pattern. - .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision table + stale cross-reference (vault migrated off it 2026-04-25). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:34:49 +00:00
Viktor Barzin	aa948be581	storage: migrate tandoor off proxmox-lvm to NFS (LUN-cap relief) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details tandoor is PostgreSQL-backed with no embedded DB, so its media/static PVC is NFS-safe. Frees one proxmox-csi SCSI-LUN slot. Part of the 'harden proxmox-csi + NFS' plan (keeps PVC mobility, no new hardware) — see docs/plans/2026-06-05-block-storage-harden-nfs-design.md. - swap tandoor-data-proxmox -> nfs_volume module (nfs-truenas SC) - data copied + verified (HTTP 200 on NFS volume); block PVC removed - block LV retained per SC policy (orphan cleanup tracked in code-dfjn) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:34:47 +00:00
Viktor Barzin	bc33cd5ac4	monitoring: NodeFilesystemFull 90%->95% + Synology storage runbook The Synology offsite backup target (/mnt/synology-backup, surfaced via the PVE host NFS mount) sits at ~94% by design and was firing NodeFilesystemFull continuously. Per user request, raise the threshold to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem rule, so this also loosens the warning on k8s node/system disks; BackupDiskFull (sda /mnt/backup) stays at 85%. Also adds docs/runbooks/synology-storage.md: how to assess Synology usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup), btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment (94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup candidates (redundant gphotos Takeout, old laptop VM images, archives). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 18:18:31 +00:00
Viktor Barzin	dbe115910f	monitoring: add local-only prometheus-query.lan ingress for ha-sofia SNMP sensors ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage, fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120, ~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac scrape), scan_interval 30. This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` → `prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to `/api/v1/query` (read-only instant-query only — not the UI/admin/federation). ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from a REST sensor), so this mirrors the existing local-only `.lan` exporter ingresses HA already queries. The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`) was edited in place (auto-version-controlled by the HA version-control add-on; pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan` was added manually via the API — like the other `.lan` exporter hosts it is NOT auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me` records). Follow-up (already noted for the Loki sensor): extend that sync to manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now vestigial (HA no longer reads it). Verified: all 7 HA sensors report correct fresh values from Prometheus (fan 10800 rpm, CPU 62.0C, power 280W, PSU 230/240V). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:25:06 +00:00
Viktor Barzin	b7cb74f1b5	docs(monitoring): cluster log aggregation (Alloy fix) + Cluster Logs dashboard + HA sensors [ci skip] Document the 2026-06-05 cluster-wide log observability work: the Alloy local.file_match fix (loki.source.file doesn't expand globs) + stage.cri, the new "Cluster Logs" Grafana dashboard, the ha-sofia cluster-log-health REST sensors, and the loki.viktorbarzin.lan Technitium-record follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:15:57 +00:00

1 2 3 4 5 ...

308 commits