infra

Author	SHA1	Message	Date
Viktor Barzin	eae35c511a	pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere Completes the internal port table of the mail front door (10.0.20.1): 443 was squatted by the pfSense webGUI (self-signed cert expired 2022), so internal webmail and the kuma [External] mail probe hit the firewall login instead of Roundcube — the last leg of the mail split-brain name. Design (Viktor): route by what the client asked for. New HAProxy frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp): SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge pattern, no health check per the PROXY-probe gotcha); SNI of pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI, which moved to :8443 (invisible to habits — https://10.0.20.1 still lands on the login page; :8443 doubles as direct fallback). The reverse-proxy pfsense ingress now targets :8443 directly. Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified: bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI; pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me -> Roundcube with STRICT cert validation; :993 IMAPS untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:41:07 +00:00
Viktor Barzin	2825cb1703	workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit) Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone itself; he wants ~/code to be the directory where all her project repos (tripit etc.) live side by side, with infra moved to a subdirectory. - roster.yaml gains per-user 'code_layout: single\|workspace' + 'repos', validated + derived by roster_engine.py (12 new tests, 40 total). - t3-provision-users reconcile: auto-migrates a single-layout ~/code to ~/code/infra (running processes follow the moved inode), hoists nested project clones to the workspace root, clones roster repos from Forgejo AS the user (their PAT makes private repos work), and wires the documented forgejo remote + forgejo/master upstream into clones that predate that contract. - Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS read, shifting later fields left (groups was only safe by being the last field) — emit '-' sentinels instead. - start-claude.sh session freshen is layout-aware (freshens each repo under ~/code for workspace users). - managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md updated in the same change. Applied live: ancamilea = workspace (infra at ~/code/infra, her existing tripit clone hoisted to ~/code/tripit, master upstream switched to forgejo/master); emo stays single layout, untouched. [ci skip]	2026-06-10 18:05:31 +00:00
Viktor Barzin	daddafd279	docs: superset rule for the internal viktorbarzin.me zone (mail-auth records) [ci skip] Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:47:31 +00:00
Viktor Barzin	e7fbf986fb	workstation: rename tmux persistence out of the t3 namespace [ci skip] Viktor's correction: this feature is about the tmux web-terminal sessions, not t3 — t3 auto-saves its own threads (~/.t3 state + daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist (units tmux-persist-save.timer / tmux-persist-restore.service, state /var/lib/tmux-persist), header rescoped to say exactly that. Same mechanism, correct taxonomy. Old units removed, state migrated, re-verified live (5 emo + 3 wizard sessions snapshotted). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:42:52 +00:00
Viktor Barzin	2e4f48f3fc	workstation: tmux sessions survive devvm reboots (save timer + boot restore) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor: emo's open web-terminal sessions must persist across reboots. Claude conversations were already durable on disk; the volatile part was the tmux wiring (which named session runs which conversation). t3-tmux-sessions save (5-min timer) snapshots every roster user's sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid taken from argv --resume (self-sustaining once restored) or the newest transcript in the cwd-slug project dir created after process start (fresh launcher sessions; claude does NOT hold its transcript fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore (boot oneshot, also safe after partial loss) recreates missing sessions with claude --resume <uuid>. Reconciler self-heals both units' enablement. Verified live: emo's 5 sessions snapshotted with correct uuids; killed R730-cooling -> restore brought it back resuming the same conversation (context meter identical); other sessions untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:39:32 +00:00
Viktor Barzin	59a531b8e0	coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip] Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP (10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods become ordinary internal clients (CNAME -> apex -> live Traefik LB; mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma monitors that rode the TP-Link NAT loopback (hard-down since 06-09; loopback refuses flows whose source equals the reflection target, which all pfSense-SNAT'd cluster traffic does). Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic to LB IPs; verified from pods on three non-Traefik nodes) — re-verify after major k8s upgrades; canary = [External] fleet going red. The NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both fight return-path asymmetry and deepen TP-Link dependency. Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1, forgejo -> Traefik ClusterIP (pin kept for Technitium-outage resilience). Proxied [External] monitors now test the internal path — true edge fidelity moves to the external vantage (ha-london, next fix). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:21:34 +00:00
Viktor Barzin	35c89fa90c	workstation: managed Claude config self-deploys from the repo [ci skip] Viktor's claudeMd edits must keep reaching every user now that emo is out of the shared tree. Two reconciler additions: - sync_managed_config: installs scripts/workstation/managed-settings.json to /etc/claude-code whenever the repo copy changes — editing the org claudeMd is now edit + commit, no manual install step - refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md (static mirror of the claudeMd; header-guarded so user-customized files are never clobbered) Verified live: corrupted emo's mirror -> reconcile restored it; wizard's stale mirror refreshed; in-sync managed config no-ops. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:03:24 +00:00
Viktor Barzin	8cfd0e5e5c	Merge forgejo/master: reconcile diverged lineages [ci skip] Local checkout carried the 2026-06-10 DNS/registry architecture series (pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes stock) + vzdump/nfs-mirror/workstation-rebuild commits that never reached the canonical remote, while forgejo master received the emo-access series via isolated worktrees. Viktor asked to merge. Conflict resolutions (newest iteration wins in each file): - stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert after live retention orphaned OCI indexes; remote had 06-09 enable) - .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final registry/DNS architecture + implemented vzdump alerts - scripts/workstation/setup-devvm.sh: LOCAL — pinned-version, reproducible-rebuild refactor (kubelogin pin, restructured staging) - scripts/workstation/managed-settings.json: FORGEJO — the allow-then-audit claudeMd (matches /etc deployment byte-for-byte) - scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone intact [ci skip]: all stack changes in the local lineage were applied live this morning — CI would re-walk 100+ stacks via the modules/ fallback for zero state change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:21:50 +00:00
Viktor Barzin	a34f9ff3b8	docs: infra Woodpecker repo-82 ops — in-cluster webhook, secret parity, empty-commit gotcha [ci skip] Emo's first direct pushes surfaced three latent CI issues, all fixed out-of-band today and recorded here: webhook deliveries to ci.viktorbarzin.me timing out on the public-IP hairpin (hook now targets the in-cluster woodpecker-server service), repo 82 registered without the repo-scoped secret set (cloned from repo 1 in the DB), and empty commits compiling every workflow so missing secrets hard-error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:09:17 +00:00
Viktor Barzin	a49d1eadf6	workstation: emo direct master push — allow-then-audit [ci skip] Viktor: emo may make any change; what matters is tracking what changed and why. ebarzin added to master push+merge whitelists (force-push stays disabled — append-only history). Tracking enforced three ways: - agent instructions (managed claudeMd + AGENTS.md): commit body MUST carry the user's plain-language intent; commits land on master directly; [ci skip] forbidden for non-admins - new notify-nonadmin-push step in .woodpecker/default.yml: Slack message for every non-admin master push (admin pushes silent) - PR flow remains the fallback for non-whitelisted users Accepted consequence (informed): emo's pushes auto-apply changed stacks via CI. Offboard runbook gains whitelist-removal step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:53:43 +00:00
Viktor Barzin	2e5af5dc0e	workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip] Non-admins (emo) need current master without manual pulls. Two layers: - t3-provision-users reconcile gains refresh_locked_clone: fetch all remotes + ff-only master, guarded (on master, clean tree, upstream set); dirty/diverged clones are left alone with a WARN. - start-claude.sh freshens ~/code at session launch, 15s-capped so an offline remote never delays the session. Verified live on emo's clone: stale clone ff'd to tip by the reconciler; launcher snippet ff's when clean and refuses while a dirty file exists. Deployed to /usr/local/bin/t3-provision-users, /etc/skel/start-claude.sh, and emo's launcher. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:41:38 +00:00
Viktor Barzin	5d9417fbaa	workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip] ADR-0004's premise was wrong: pushing master fires the Woodpecker apply pipeline (require_approval=forks only), so master pushes ARE deploys. Added Forgejo branch protection on master (push/merge whitelist=viktor, deploy keys allowed); non-admins contribute via branches + PRs. emo (ebarzin): write collaborator on viktor/infra, PAT in ~/.git-credentials, forgejo remote + upstream in his locked clone. Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they ARE the skel shared-base mechanism — plan step 4c obsolete). Offboard runbook: revoke PAT + collaborator + group steps added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:30:41 +00:00
Viktor Barzin	e49c91e60c	monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl, mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success. NOT [ci]-applied: this is a Terraform stack change — arms on the next `scripts/tg apply` of the monitoring stack (metrics already flow, so it arms immediately once applied). Admin-gated apply per org policy. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:10:46 +00:00
Viktor Barzin	d9ea7812f5	nfs-mirror: exclude /vzdump/ — it was reaping the new VM-image backups nightly nfs-mirror does `rsync -rlt --delete /srv/nfs/ -> /mnt/backup/`; any /mnt/backup dir with no /srv/nfs counterpart is an orphan and gets --delete'd. vzdump-vms (added yesterday) writes /mnt/backup/vzdump/, which wasn't excluded — so the 02:00 nfs-mirror run silently deleted both successful 40G devvm images (verified: dir gone, 40G freed, despite status=0 success logs). Add --exclude='/vzdump/' alongside the existing pvc-data/pfsense/pve-config/ sqlite-backup excludes that exist for exactly this reason. TDD-proven with an isolated rsync --delete -n -v. backup-dr.md notes the dependency. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:04:57 +00:00
Viktor Barzin	2b8c0def30	dns: pfSense forward-zone for viktorbarzin.me, nodes fully stock [ci skip] Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node customization — split-brain lives in the DNS infra): - pfSense Unbound domain override viktorbarzin.me -> Technitium 10.0.20.201 (applied via php write_config, backup on-box). Every Unbound client on every VLAN now gets the internal split-horizon answers (live Traefik IP via apex CNAME) with zero per-host config. - CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block — forgejo pinned to Traefik ClusterIP via data source (pods cannot reach the ETP=Local LB IP pfSense now returns), all other .me names kept on public resolvers (pods' pre-existing behavior). Replaces the .:53 forgejo rewrite. - Removed the same-day resolved routing-domain drop-ins from all 7 nodes; node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206) for fleet parity; cloud-init no longer writes any DNS drop-ins. - Docs: dns.md, pfsense-unbound runbook (override + rollback), registry bullet, post-mortem final-architecture addendum. Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK, pods resolve forgejo -> ClusterIP / others -> public, mail record works, .lan zone unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 08:32:34 +00:00
Viktor Barzin	1ee1bf0817	forgejo pulls: route *.viktorbarzin.me to Technitium, drop /etc/hosts pins [ci skip] Supersedes this morning's per-node /etc/hosts pin (no hardcoded service IPs on nodes, per Viktor). Technitium's split-horizon zone already resolves forgejo.viktorbarzin.me -> CNAME apex -> live Traefik LB IP (ingress-dns-sync auto-CNAMEs every ingress host; apex drift probe alerts) -- the nodes just never queried it. Rolled the devvm's systemd-resolved routing-domain pattern (~viktorbarzin.me -> 10.0.20.201) to all 7 nodes, removed the pins, verified getent + crictl pull via pure DNS. Also demoted node5/6's cloud-init global-dns.conf (DNS=8.8.8.8 1.1.1.1) to FallbackDNS-only: public servers in the global set race the routing domain. Its justification ("Technitium NXDOMAINs forgejo") was obsolete -- exactly the stale comment that pointed new nodes at the hairpin. hosts.toml mirror kept but documented as vestigial (Traefik 404s bare-IP requests; registry auth realm is an absolute URL). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:56:31 +00:00
Viktor Barzin	b6976ce014	forgejo pulls: pin registry name to internal Traefik in node /etc/hosts [ci skip] tuya-bridge was down 7.5h (ImagePullBackOff on k8s-node3): fresh kubelet pulls of forgejo.viktorbarzin.me images depended on the intermittently broken public-IP hairpin. The containerd hosts.toml mirror cannot keep pulls internal on its own — Traefik 404s its bare-IP requests (no Host/SNI match) and the registry Bearer realm is an absolute public URL fetched outside the mirror. Third incident of this class (buildkit 06-04, tripit/devvm 06-09). Fix: /etc/hosts pin 10.0.20.203 forgejo.viktorbarzin.me on every node — covers resolve + token + blob legs with correct SNI and valid cert. Applied live to all 7 nodes; persisted in the cloud-init bootstrap and the existing-node rollout script. Docs updated (registry bullet, dns.md hairpin scope + stale .200 literals, runbook) + post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 07:15:24 +00:00
Viktor Barzin	7330cb6a0b	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:41:54 +00:00
Viktor Barzin	83f418159a	backup: image-level vzdump of hand-managed VMs (devvm) — close no-VM-backup DR gap The hand-managed Linux VMs (not in Terraform) were never imaged: the PVC/NFS/pfSense/PVE-config scripts cover cluster data but no VM disk. A lost devvm disk = unrecoverable home dirs + local-only git repos (monorepo root has no remote). vzdump-vms.{sh,service,timer}: daily 01:00 live `vzdump --mode snapshot` of VZDUMP_VMIDS (default 102=devvm) -> /mnt/backup/vzdump (Copy 2), keep 3; the monthly offsite-sync full pass mirrors it to Synology (Copy 3). Guest agent enabled -> fs-consistent. Nice/idle-ionice so it never starves etcd. Pushgateway job vzdump-backup. Deployed live to PVE + timer enabled. Docs updated: backup-dr.md (new VM-image layer + protection matrix), infra CLAUDE.md, AGENTS.md. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:30:19 +00:00
Viktor Barzin	fd0f4a0365	fix: restore tree dropped by `6d224861`; land stem95su gdrive-sync (10m) [ci skip] `6d224861` came from a --no-checkout worktree whose empty index made the commit drop every file except two. This restores 05b50d2b's full tree and correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the live infra was never applied from the broken commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:45:33 +00:00
Viktor Barzin	6d224861c4	stem95su: scheduled Drive->site sync CronJob (every 10m) CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard + --max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault secret/stem95su. Requires the GCP OAuth app published to Production or the refresh token expires ~weekly. Lands the gdrive-sync stack on master (it had landed on a feature branch by accident on the shared devvm checkout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 08:42:26 +00:00
Viktor Barzin	37626cb89b	workstation: docs — mark RBAC + Authentik gate applied [ci skip] multi-tenancy.md + service-catalog.md status: per-user OIDC kubeconfig, oidc-power-user-readonly ClusterRole, emo k8s_users entry, and the Authentik T3 Users edge gate are now applied + verified. Remaining: emo cutover (Phase 5, held), offboarding apply-side (Phase 7), per-user MCP injection, roster-reconciled group membership. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:51:44 +00:00
Viktor Barzin	c611ecf84d	workstation: docs — multi-tenancy Workstation section + offboard runbook + service-catalog fix [ci skip] multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 14:27:17 +00:00
Viktor Barzin	3d6c5b8bc7	matrix/authentik: remove orphaned Matrix OAuth2 app + provider (post-tuwunel) The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel uses native password auth, so nothing consumed it. Deleted application `matrix` + OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale Matrix rows from the SSO reference tables and update the plan's residual list. Doc-only [ci skip]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 12:32:49 +00:00
Viktor Barzin	23602f393e	matrix: migrate Synapse -> tuwunel (Rust homeserver, fresh start, federated) Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB drops the CNPG dependency (both init-containers, the db ESO, the Reloader annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation on, tuwunel-served well-known delegation to :443. server_name unchanged (matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path). Registered @viktor admin then disabled registration (403). Cleanup: removed the orphaned pg-matrix Vault static role and dropped the matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*. Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so [ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC tune-TTL drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 11:58:17 +00:00
Viktor Barzin	d4ec5768b2	vault-token-renew: version the devvm renewer + user units in the repo The devvm periodic Vault admin token (token-devvm-wizard, period=768h, policies default+sops-admin+vault-admin) is kept alive by a systemd user timer, but the renewer script + units lived only under ~/.local/bin and ~/.config/systemd/user — lost on a devvm rebuild. Move them into the repo as the source of truth so a rebuild can restore them. (version-only scope: behavior unchanged; no canonical-file/self-heal added.) - scripts/vault-token-renew.{sh,service,timer}: renewer + user units, refactored into pure drift-guard functions + a guarded main (behavior identical; deployed live and verified still renewing with full write access). - scripts/test-vault-token-renew.sh: unit-tests the drift guard + lookup-JSON parsing, incl. the 2026-06-05 woodpecker-clobber case (17 assertions). - docs/runbooks/vault-token-renew-devvm.md: deploy, mint/re-mint, health-check, drift recovery. - docs/architecture/secrets.md: correct the stale '~/.vault-token = OIDC token' description for devvm. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:10:06 +00:00
Viktor Barzin	9529eedfe0	docs(security): bot-block-proxy is a no-op while poison-fountain is at 0 [ci skip] Reflect commit b6dd23b1: bot-block-proxy short-circuits /auth to return 200 instead of proxying to the scaled-to-0 poison-fountain. - security.md Layer 1 + tarpit description + troubleshooting (fix stale stacks/platform path -> traefik stack; drop misleading restart-poison-fountain step). - .claude/CLAUDE.md: add matrix to PG rotation list; document that startup-read secret consumers need a Reloader annotation (matrix root cause, found via Loki 2026-06-05). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-06 16:51:26 +00:00
Viktor Barzin	d808694af4	docs(storage): record harden-half shipped (orphan cleanup + ghost-reconcile) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details 2a orphan cleanup (67 Released PVs + 475 LVs removed, VG pve 997->~410) + 2b csi-ghost-reconcile CronJob done — ghost-disk doom loop closed by construction, beads code-dfjn retireable. Cap kept at 28 (lowering would reverse the 2026-05-25 eviction-cascade post-mortem fix). Phase-1: insta2spotify migrated (noted its 3.26GB image re-pull blip on node reschedule). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 21:39:36 +00:00
Viktor Barzin	63182730f9	docs(storage): record Wave-2 NFS migration + harden-proxmox-csi decision (option 1) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Document the 2026-06-05 decision to keep proxmox-csi and harden it (keep PVC mobility, no hardware) over TopoLVM (pins to node) / Longhorn (2x writes on single shared HDD). Wave-2 moved 5 non-DB workloads off block to NFS (tandoor, speedtest, hackmd, changedetection, send), freeing 5 LUN slots. - storage.md: live PVC counts, Retain-policy/orphan-LV note, Wave-2 history, updated cap-relief levers - topolvm-evaluation.md: stamped NOT ADOPTED with rationale + pointer to the decision doc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:15:21 +00:00
Viktor Barzin	c24b4a21d8	docs(architecture): fix stale 5-node claim -> 7 nodes (k8s-node1..6) [ci skip] Cluster grew to 7 nodes (k8s-master + node1..6; node5/6 added ~10d ago) but several docs still said "5 nodes". Corrected with live specs: - overview.md: 7-node enumeration; node1 is 16c/48GB (doc wrongly said 32GB), node2-6 are 8c/32GB general workers - compute.md: "5-node" -> "7-node" cluster description - dns.md: NodeLocal DNSCache DaemonSet "5 nodes" -> "7 nodes" - mailserver.md: HAProxy backend diagram "node1..4" -> "node1..6" Illustrative "0/5 nodes available" scheduler-error examples left as-is. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 20:03:58 +00:00
Viktor Barzin	52f5de905d	docs(context): freshen infra glossary (modules, tiers, new concepts) [ci skip] Refresh CONTEXT.md against current repo + cluster reality (grill-with-docs): - Module taxonomy rewrite: drop fictional k8s_app/helm_app/postgres_app factory modules (never existed); name the real four (ingress_factory, nfs_volume, anubis_instance, setup_tls_secret) + the shared / Stack-local / flat distinction; flag vestigial modules/kubernetes/<app> dirs. - Rename "Ingress auth tier" -> "Ingress auth" (discrete modes, not tiers); reserve "tier" for State tier + Namespace tier only. - Add local-path entry (cluster default SC; node-local footgun warning). - Add concepts: Keel, Diun, CNPG/pg-cluster, MetalLB LB-IP split, Calico. - Add "policy" ambiguity flag (Kyverno vs Calico NetworkPolicy vs Vault/RBAC). - Fix node count 5 -> 7 (k8s-master + k8s-node1..6). Doc-sync (same commit per repo rules): - overview.md: replace fictional factory modules with the real shared modules + the flat/stack-local pattern. - .claude/CLAUDE.md: drop dead nfs-proxmox column from the storage decision table + stale cross-reference (vault migrated off it 2026-04-25). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 19:34:49 +00:00
Viktor Barzin	dbe115910f	monitoring: add local-only prometheus-query.lan ingress for ha-sofia SNMP sensors ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage, fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120, ~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac scrape), scan_interval 30. This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` → `prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to `/api/v1/query` (read-only instant-query only — not the UI/admin/federation). ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from a REST sensor), so this mirrors the existing local-only `.lan` exporter ingresses HA already queries. The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`) was edited in place (auto-version-controlled by the HA version-control add-on; pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan` was added manually via the API — like the other `.lan` exporter hosts it is NOT auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me` records). Follow-up (already noted for the Loki sensor): extend that sync to manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now vestigial (HA no longer reads it). Verified: all 7 HA sensors report correct fresh values from Prometheus (fan 10800 rpm, CPU 62.0C, power 280W, PSU 230/240V). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:25:06 +00:00
Viktor Barzin	b7cb74f1b5	docs(monitoring): cluster log aggregation (Alloy fix) + Cluster Logs dashboard + HA sensors [ci skip] Document the 2026-06-05 cluster-wide log observability work: the Alloy local.file_match fix (loki.source.file doesn't expand globs) + stage.cri, the new "Cluster Logs" Grafana dashboard, the ha-sofia cluster-log-health REST sensors, and the loki.viktorbarzin.lan Technitium-record follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 17:15:57 +00:00
Viktor Barzin	6b1d23abbd	monitoring: migrate R730 iDRAC scraping to SNMP (fast primary) + thin Redfish remnant The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac module: ~3.7s/scrape at 1m. - snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM (coolingDeviceReading + location lookup) and an amperageProbeLocationName lookup so the "System Board Pwr Consumption" watts probe is label-selectable. - snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*). - Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes. - Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names; temps ÷10 (tenths-degC); DellStatus value-mappings updated. - Demote the Redfish exporter to a slow remnant: trim collectors to system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change. SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan + as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md, docs/architecture/monitoring.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 16:33:20 +00:00
Viktor Barzin	722a1c9b42	docs(monitoring): document rpi-sofia off-box monitoring + log shipping [ci skip] Add an "External host: rpi-sofia" section to docs/architecture/monitoring.md covering the 2026-06-05 setup: node_exporter + vcgencmd textfile metrics; the full-journal promtail->Loki shipping (job=rpi-sofia-journal — kernel/dmesg via the (none) unit + all systemd units, labeled by unit/level); the RPi Sofia alert group; the dashboard; and the systemd watchdog. Notes the SD-card root cause and that the Pi-side config is hand-managed + backed up off-box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 14:25:20 +00:00
Viktor Barzin	3796a84e04	docs: f1-stream is Woodpecker-native (Forgejo viktor/f1-stream), not GHA/repo-10 f1-stream was extracted to its own Forgejo repo + deployed from the Forgejo registry (2026-06-05). Correct the stale "Migrated to GHA / repo id 10" claims: - CLAUDE.md + ci-cd.md: move f1-stream from the GHA list to the Woodpecker-native owned-app group; note old github source archived + GHA Woodpecker repo 10 deactivated; f1-stream is now Woodpecker repo 166. - service-catalog: note the source repo + deploy model.	2026-06-05 09:19:12 +00:00
Viktor Barzin	147a8cff40	Restore f1-stream stack — undo accidental bundling into 63fe7d2b Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the shared infra working tree and inadvertently swept in a parallel session's staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals, ci-cd.md + .claude docs, two extraction plan docs). This returns every f1-stream-related path to its pre-63fe7d2b state (3493c347) so that extraction can be committed cleanly by its own session. The fan-control files added in 63fe7d2b are untouched. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:12 +00:00
Viktor Barzin	90ad6b9125	fan-control: presence-aware IPMI fan curve for the R730 PVE host The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even under load (optimises for quiet, not cool). Add a bash daemon + systemd unit that drives the chassis fans from CPU temp on two curves, picked by garage occupancy (the server is in the garage): COOL when empty (measured ~58-65°C under load), QUIET near the silent floor when the ha-sofia garage door shows someone is there (open, or <15min since last activity). Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI failures do the same. Pushgateway metrics (job=fan_control). 36 unit tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN + RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127 (CPU 70->58°C in cool mode, hysteresis stepping confirmed). Design: docs/plans/2026-06-04-pve-fan-control-design.md Runbook: docs/runbooks/fan-control.md [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	8f13fdeaf7	docs: dashboard SA cluster-read tightened to namespace-list + nodes only [ci skip] Reflect the dashboard-nav-readonly ClusterRole: namespace-owners can list namespaces/nodes (for dashboard nav) but not read other tenants' resources. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:11 +00:00
Viktor Barzin	c4bd64f88a	docs: dashboard now auto-injects per-user SA token (no token-paste) Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map regen). [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	8e44ccaa65	docs: dashboard access is forward-auth + token-paste (OIDC SSO blocked) Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality: apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste. Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state, and add-user skill (onboarding now documents the dashboard token). oauth2-proxy + k8s-dashboard OIDC app noted as idle. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:10 +00:00
Viktor Barzin	4aa6e7a5af	chrome-service docs: clarify f1-stream is not a real caller stacks/f1-stream/files/backend/playback_verifier.py and chrome_browser.py describe an in-cluster CDP caller, but the deployed f1-stream image is built from github.com/ViktorBarzin/f1-stream which has neither file — verified by `kubectl exec ls /app/backend/` and grepping for 'CHROME' in the deployed pod. The infra/stacks/f1-stream/files/backend/ tree is a vestigial design that was never wired up to a build pipeline. Calling it out so the next reader doesn't waste time debugging why the migration "didn't take effect" — it took effect on dead code. The hourly snapshot-harvester CronJob is the only live in-cluster caller of the CDP endpoint today.	2026-06-05 09:19:10 +00:00
Viktor Barzin	deede6dd11	chrome-service: switch to CDP + persistent profile + hourly snapshot pipeline The chrome-service stack ran `playwright launch-server`, which creates ephemeral browser contexts per `connect()`. Despite the encrypted PVC mounted at /profile, no chromium user-data ever persisted — only npm cache + fontconfig. Logging in via noVNC was effectively a no-op. Refactor: - Replace launch-server with direct chromium (TCP CDP on :9223 internal), fronted by a Python HTTP+WS bridge on :9222 that rewrites the Host header to bypass Chrome's hardcoded DNS-rebinding protection (no `--remote-allow-hosts` flag exists in stock Chrome 130; verified by binary string grep). Bridge also forces Connection: close on HTTP responses so Node ws opens a fresh TCP for the WS upgrade rather than trying to reuse the dead keep-alive socket. - Add `--user-data-dir=/profile/chromium-data` so cookies/localStorage actually persist on the encrypted PVC. - New snapshot-server sidecar (stdlib python HTTP) serves GET /api/snapshot at chrome.viktorbarzin.me/api/snapshot, bearer-token-gated by the existing api_bearer_token. - New chrome-service-snapshot-harvester CronJob (hourly) connects via CDP, dumps storage_state() (cookies + localStorage), writes atomically to /profile/snapshots/storage-state.json. - NetworkPolicy: TCP/9222 (was :3000), TCP/8088 added for traefik. Caller migration: - f1-stream: `chromium.connect(ws_url)` → `chromium.connect_over_cdp(cdp_url)`, env var CHROME_WS_URL → CHROME_CDP_URL. CHROME_WS_TOKEN dropped (no longer used by code; ExternalSecret kept for symmetry with the snapshot endpoint). Dev-box side (out of scope for this commit — see ~/.config/systemd/user/): - playwright-mcp.service flips to `--isolated --storage-state=...` so per-Claude-Code-session ephemeral contexts seed from the snapshot. - playwright-snapshot-refresh.{service,timer} (hourly) pulls the snapshot via the bearer-gated HTTPS endpoint. Docs updated: - docs/architecture/chrome-service.md — new architecture diagram + wire protocol. - docs/runbooks/chrome-service-snapshot.md — day-2 ops (refresh, rotation, failure modes, restore). - stacks/chrome-service/README.md — connect_over_cdp recipe. Design spec at docs/superpowers/specs/2026-06-04-playwright-per-session-browser-design.md.	2026-06-05 09:19:10 +00:00
Viktor Barzin	ad3432d685	docs(k8s-dashboard): dashboard SSO as-built (Option B multi-issuer apiserver) Update authentication.md (structured multi-issuer AuthenticationConfiguration + dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state (new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth), and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config → re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint pivot from the original dual-aud approach. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:09 +00:00
Viktor Barzin	f201e4573e	immich: fix slow context search — prewarm clip_index + latency alert/healthcheck Context (smart) search latency was caused by the 665MB vchord clip_index decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs ~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual is warm ~15ms on GPU). The postStart prewarm runs once at pod start and pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job buffer-pressure over days. - clip-index-prewarm CronJob (immich, /5): pg_prewarm('clip_index') keeps the whole index resident -> searches stay ~4ms. - immich-search-probe CronJob (immich, /5): times a random-vector ANN query + reads clip_index residency, pushes gauges to the Pushgateway. - Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache / ImmichSearchProbeStale (+ inhibition when the probe is stale). - cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46). - Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 09:19:07 +00:00
Viktor Barzin	98f29edf34	technitium: CoreDNS rewrite forgejo.viktorbarzin.me -> Traefik ClusterIP In-cluster pods resolved forgejo.viktorbarzin.me to the public IP (176.12.22.76) and hairpinned out through the WAN gateway, intermittently timing out buildkit pushes from Woodpecker build pods (which, unlike kubelet, don't use the per-node containerd Forgejo mirror). This silently failed CI build-and-push for Forgejo-hosted repos (recruiter-responder pipelines #15-#18 at the push step). Add a CoreDNS `rewrite name exact forgejo.viktorbarzin.me traefik.traefik.svc.cluster.local` so pods resolve to the Traefik ClusterIP (reachable in-cluster, unlike the ETP=Local LB .203; the Service-name target auto-tracks the ClusterIP so it can't rot on a Traefik renumber). Traefik's *.viktorbarzin.me wildcard keeps SNI/TLS valid. Makes the per-pod woodpecker-server hostAlias belt-and-suspenders. Applied via targeted apply (coredns ConfigMap only, to avoid reconciling 7 unrelated pre-existing drifts in the stack) + verified: - pod resolves forgejo.viktorbarzin.me -> 10.111.111.95 (Traefik ClusterIP) - recruiter-responder pipeline #20 build-and-push succeeds via ClusterIP Docs: networking.md (K8s cluster DNS path) + .claude/CLAUDE.md (forgejo registry quick-ref). Advances beads code-yh33. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 07:34:30 +00:00
Viktor Barzin	7d7a0ad474	infra: fix stale Traefik LB-IP refs + accurate LB-IP registry Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was canceled Details Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md). The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead .200; this fixes the two in-Terraform ones and replaces the stale networking doc with an accurate registry + a renumber checklist. - woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200 (.200:443 refuses TLS now; the next woodpecker apply would re-pin it and break pipeline creation). Now reads the Traefik ClusterIP dynamically via a kubernetes_service data source -- cannot rot on a future renumber and avoids the ETP=Local hairpin trap. - monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200" -> 10.0.20.203 (cosmetic; alert logic already correct). - docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP registry + LB-IP renumber checklist (in-band + out-of-band consumers). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 10:24:25 +00:00
Viktor Barzin	c7cf21a986	Revert mail LAN-redirect approach; pending VIP-based redesign The pfSense NAT rdr rules added in f7cf9f07 hardcoded 10.0.20.203 (Traefik LB IP) as the redirect source. That couples mail's LAN path to Traefik's IP choice — if Traefik moves again (it just moved .200 → .203 on 2026-05-30), the mail path silently breaks. Removing the script and the matching doc paragraph; keeping the networking.md .200 → .203 staleness fix (separate correction). Follow-up: give the mail HAProxy listener a dedicated pfSense Virtual IP (IP Alias on opt1), update Technitium internal zone + WAN port-forwards to target the VIP, so mail's LAN-side path is decoupled from any other service's LB IP.	2026-06-03 10:24:25 +00:00
Viktor Barzin	922d95af9c	Reapply "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit a82ba46ad83e85a231d839564c2f009c700dc4d1.	2026-06-03 10:24:25 +00:00
Viktor Barzin	f0843e398b	Revert "tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse" This reverts commit 4cc9229e716b6683418a148a0f896442d5ab07ad.	2026-06-03 10:24:25 +00:00

1 2 3 4

156 commits