infra

Author	SHA1	Message	Date
Viktor Barzin	318ce9b909	Merge remote-tracking branch 'forgejo/master' into wizard/breakglass-redesign All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-11 18:23:40 +00:00
Viktor Barzin	df332b59e6	break-glass SSH: drop port-knock for exposed key-only :52222; version host config Viktor got locked out of the break-glass path (forgot the port-knock setup) and deleted the edge-router forwards, then asked to review and redesign it from scratch. Root cause of the lockout: the knock added no real security (key-only SSH is already brute-force-proof) and its only benefit — hiding the port — came at the cost of a circular dependency. The knock sequence lived only in in-cluster Vault, which is unreachable in the exact away/cold scenario break-glass exists for. So the unlock secret was unavailable precisely when needed. New model (self-contained, nothing to remember): plain key-only SSH on the Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222 -> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000 - it rejects remaps; port 22 itself is reserved). The exposed port trusts only a dedicated break-glass key via `Match LocalPort` (a leak of any other root key does not grant internet access), rate-limited (iptables hashlimit) + fail2ban. - Removed knockd (package + config) and the legacy Synology SSH forward (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone). - Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd - the stock journalmatch silently never banned). - Versioned the host config in scripts/ (it was applied ad-hoc, never committed) and recorded the deliberate Wave-1 "no public-IP" exception in security.md + .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:39 +00:00
Viktor Barzin	7a1cc64898	kyverno: re-trigger apply of tts GPU-priority exclusion (`87702bdc` was [ci skip]'d) Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit `87702bdc` carried [ci skip], so CI never applied the kyverno change that keeps the tts namespace out of low-GPU-priority injection. This comment-only commit makes CI apply the already-committed change — step 1 of the kyverno -> tts -> tripit apply order. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:29 +00:00
Viktor Barzin	50eff3ca39	tripit: enable real tour-guide content providers (wikipedia discovery, web sources, chat writer) Some checks failed ci/woodpecker/push/build-cli Pipeline was canceled Details ci/woodpecker/push/default Pipeline was canceled Details Viktor's tour-guide redo (tripit#24, slice tripit#25): the feature shipped dark on 2026-06-08 because these three env vars were never set, so prod ran the fake test-fixture providers — the only sight users ever saw was the placeholder 'Sight 1' narrated by browser TTS. Flips discovery to Wikipedia GeoSearch, story material to the five real web sources, and script-writing to claude-agent-service (token already present in tripit-secrets). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:22:10 +00:00
Viktor Barzin	e2788d1b2d	workstation: lean managed-settings claudeMd — org red-lines + pointers [ci skip] Viktor's agent-rules cleanup: the org claudeMd now carries only governance red-lines (RBAC tiers, per-user secrets, Terraform-only, git audit-trail rules, code-layout detection) and points to ~/.claude/rules/execution.md for the worktree lifecycle, which was previously duplicated here in full. Settings precedence and the model key are unchanged. Also refreshes a .gitignore comment that cited the old execution.md section numbering. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:02:43 +00:00
Viktor Barzin	c3a63fcd38	apply-mbps-caps: compare normalized option sets (true idempotency) + devvm I/O-stall post-mortem [ci skip] The raw string compare never matched qm config's canonical key order, so the hourly timer re-issued 'qm set' against every running capped VM, live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU (blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi controller path with no iothread. Viktor asked to root-cause the freeze before choosing fixes, then approved mitigating via VM settings: this commit fixes the hourly trigger and documents the incident; the controller swap (virtio-scsi-single + iothread=1 + aio=threads) is staged on VM 102 separately, pending his cold stop/start. Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain, ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md + proxmox-inventory.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:00:08 +00:00
Viktor Barzin	2e0cebff87	docs: sync compute/storage/proxmox-inventory with live state (memory audit) [ci skip] Viktor asked to go through the agent's stored infra facts and straighten out anything wrong about what-is-where. Cross-checking docs against the live cluster surfaced doc drift alongside the stale memories: - compute.md: add k8s-node5/6 (joined 2026-05-26) to diagram + node table; totals 48 vCPU / ~176GB -> 64 vCPU / ~240GB; cluster version v1.34.2 -> v1.34.8 (live-verified) - storage.md: the nfs-proxmox StorageClass no longer exists (removed 2026-04-25, commit `484b4c71`) — nfs-truenas is the only NFS SC; fixed three spots that told readers to use nfs-proxmox - proxmox-inventory.md: k8s VM RAM rows live-verified via kubectl (master 32G, node1 48G, node2-4 32G — the old 16/32/24G figures predated the 2026-04-02 resize), added node5/6 rows, devvm swap 8G -> 14G (grown 2026-06-10), recomputed total (~288GB nominal of 272GB physical, overcommitted) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 17:50:43 +00:00
Viktor Barzin	9b19caff47	t3: connection logging across the path for drop attribution All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor asked to add connection logs (Traefik/Cloudflare) to catch the real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean while real tunnel sessions cycle every 15-35s, so the drop originates above t3-serve and we need to see which layer cuts the socket. Traefik (/ws duration) and cloudflared (WS close events) already ship to Loki; the gap was the devvm side. This adds: - t3-dispatch logs every /ws open/close with dur_ms + cause: downstream_closed (client/CF/Traefik hung up = last-mile/network), upstream_closed (t3-serve closed/reset), or graceful. Graceful closes previously left no trace (default ReverseProxy only logs on error), so a watchdog-driven reconnect was invisible. Helpers unit-tested. - devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch + t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the pve/rpi-sofia shippers. devvm was never in Loki (standalone VM). Joined in Loki the three layers attribute any future drop to a segment with no repro needed. Runbook + service-catalog updated.	2026-06-11 13:48:10 +00:00
Viktor Barzin	933e4649fb	Merge remote-tracking branch 'forgejo/master' into wizard/authentik-signin-speed Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details	2026-06-11 00:35:56 +00:00
Viktor Barzin	b3ef0dba76	authentik: ignore Keel-managed image_pull_policy on pgbouncer Keel flip-flops the pgbouncer container's imagePullPolicy, so the declared Always kept re-diffing on every plan. Ignore it like the image tag (KEEL_IGNORE pattern) — plan-to-zero restored. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:34:44 +00:00
Viktor Barzin	4e88298976	authentik: incident hardening after the signin-speedup rollout storm The first apply of the signin-speedup change triggered a ~50min authentik outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2) silently DOWNGRADED the Keel-managed live image (2026.2.4) against an already-migrated DB, default liveness probes kill-looped pods queuing on authentik's migration advisory lock, and kills mid-migration left ghost idle-in-transaction sessions holding that lock. Full analysis in docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md. Hardening (all root causes): - values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4) so helm applies can never downgrade under Keel again - values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s) - values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits) - pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders; pgbouncer.tf gets a config-checksum annotation so ini changes roll pods - authentik_provider.tf: drop the completed import stanza (adoption rule) - traefik: suppress pre-existing keel.sh annotation/tier-label drift on auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1 pattern) so applies stop stripping live Keel state Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:26:52 +00:00
Viktor Barzin	bd60c3d5e0	pve-host/dns: register loki.viktorbarzin.lan CNAME, drop the /etc/hosts pin All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Follow-up to the pve-host Loki shipper (`aac807fb`). The host reached Loki via an /etc/hosts pin of the Traefik LB IP — Viktor flagged that as the wrong solution (no hardcoding; the DNS infra should handle it). Registered loki.viktorbarzin.lan in Technitium as a CNAME -> ingress.viktorbarzin.lan (the anchor whose A record auto-tracks the live Traefik LB IP, so it's renumber-proof), via the Technitium API + zone-sync to all 3 instances. Removed the /etc/hosts pin from the PVE host; promtail now resolves the name purely via DNS (verified still shipping to Loki). insecure_skip_verify stays — the internal .lan cert isn't publicly trusted. Docs (monitoring.md) + the pve-promtail.yaml header updated to drop the pin references. The DNS record is API-managed (the viktorbarzin.lan zone convention), not in this repo; auto-managing .lan CNAMEs in technitium-ingress-dns-sync remains a noted follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 22:55:20 +00:00
Viktor Barzin	97ccdbecb8	authentik: speed up first-time signin (single-screen login, live env tuning, asset caching, outpost+nginx hot path) Viktor asked to review Authentik and the web tier and make first-time signin to apps faster. Review found the slowness is screens and round trips, not server time. Changes: - values.yaml: the authentik.* Helm values (gunicorn workers, cache timeouts, conn_max_age) were silently INERT because existingSecret skips chart env rendering — pods ran defaults (2 workers, 300s caches, no persistent DB conns). Moved all tuning into server.env/worker.env, which actually reaches the pods. - authentik_provider.tf: adopt the identification stage and pin password_stage so username+password render on ONE screen (the separate order-20 password binding is deleted via API — authentik requires that when embedding). Outpost log_level trace->info and 1->2 replicas (it is on the hot path of every forward-auth request; PG-backed sessions make 2 replicas safe). - authentik module: /static ingress carve-out with immutable Cache-Control (assets are version-fingerprinted but served with no max-age — internal split-horizon users got zero caching). - traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was opening a fresh TCP connection to the outpost per subrequest) + config-checksum annotation so config changes roll the pods. - docs: authentication.md + authentik-state.md updated; fixed stale 'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md (it is a live CNPG primary-selector compatibility service). Done via API in the same change (UI-managed objects): 6 OIDC providers (Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access) switched from explicit to implicit consent — all first-party, the 4-weekly consent screen only slowed first-time signin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 21:58:10 +00:00
Viktor Barzin	93ba67c84a	devvm: install prometheus-node-exporter (was never installed) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The monitoring stack now scrapes devvm (job 'devvm') for the t3 drop attribution work, but the box had no node_exporter at all — installed via apt and persisted here so reprovisioning keeps it.	2026-06-10 21:29:17 +00:00
Viktor Barzin	046a4a32f3	Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 21:26:10 +00:00
Viktor Barzin	70442ccdc6	t3-probe: fix aiohttp 3.9 compat (ClientWSTimeout is 3.10+) Bound connection establishment via session ClientTimeout(total=None, connect=15) instead — works on 3.9 through current; total must stay None or the session timeout would kill the long-lived probe WS. Verified by a local 14s smoke run: cloudflare + internal legs both connect.	2026-06-10 21:26:09 +00:00
Viktor Barzin	4af5eff043	docs(multi-tenancy): note the on-demand web restore button All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The tmux-persist paragraph only described the boot-time restore. Document the new manual path — the web terminal's "Restore sessions" button (tmux-api POST /restore -> tmux-restore-user wrapper -> `tmux-persist restore <user>`) — and why it exists: an OOM that kills a user's tmux server WITHOUT a reboot never triggers the boot-only restore service, which is the common case under multi-user memory pressure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 21:22:41 +00:00
Viktor Barzin	a734155fb5	Merge remote-tracking branch 'forgejo/master' into wizard/t3-disconnect-fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 21:11:30 +00:00
Viktor Barzin	9b55d53be0	t3: differential drop-attribution probe + devvm metrics Closes the loop on Viktor's ask to find the t3 disconnect root cause and definitively rule infra in or out. Server logs alone cannot separate 'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve stalled' — every cause collapses into the same 20s-watchdog reconnect. The t3-probe (stacks/t3code) holds three permanent legs that differ only in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN -> CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve process). Whichever leg drops convicts its segment; all legs clean while a user drops exonerates infra with data. Dispatch gains an unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket, test-first) behind an auth=none path carve-out, guarded by the authentik-walloff probe. Also starts scraping devvm's node_exporter (job 'devvm') — it ran unscraped, so the box whose memory/IO stalls cause the drops had zero pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook docs/runbooks/t3-drop-attribution.md.	2026-06-10 21:11:29 +00:00
Viktor Barzin	ecef09ab87	tmux-persist: add single-user restore mode (`restore [user]`) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details The web-terminal will get a "Restore sessions" button (common ask after an OOM kills a user's tmux server without a reboot, which the boot-only restore service doesn't catch). The button needs to restore ONE user's saved sessions on demand, so teach `restore` an optional <user> argument: with no arg it restores every terminal user (unchanged — the boot service path), with a <user> arg it validates the name against /etc/ttyd-user-map and restores only that user. Reuses the existing restore loop (single source of restore truth). The terminal-lobby tmux-api will invoke this as root via a validated tmux-restore-user sudo wrapper. Verified: bad user exits 2 (won't fall back to restoring everyone), no-arg path unchanged, shellcheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 21:08:57 +00:00
Viktor Barzin	b5c6639272	t3-serve@: contain agent memory storms; survive child OOM kills All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Same t3-disconnect root-cause work: a runaway claude agent child grew to 10.8G anon RSS inside t3-serve@wizard's cgroup, swap-thrashed devvm off its spinning disk (system-wide multi-10s freezes = every t3 client's 20s watchdog firing = the 'frequent disconnects that self-recover'), then the global OOM at 2026-06-10 19:56 took the whole unit down for 8.5min because the default OOMPolicy=stop fails the unit when ANY cgroup child is OOM-killed. Cap the cgroup (MemoryHigh=12G, MemoryMax=16G), forbid swap so stalls can't smear into minute-long freezes, and OOMPolicy=continue so a runaway agent dies alone while the WS server keeps serving.	2026-06-10 21:00:06 +00:00
Viktor Barzin	d5fdc7ffe9	cloudflared: disable in-place autoupdate (--no-autoupdate) Viktor asked to root-cause the frequent t3 code disconnects and rule infra in or out. The tunnel pods ran bare 'cloudflared tunnel run': every Cloudflare release made the binary self-update and exit (code 11), restarting all 3 pods and severing every WebSocket riding the tunnel — one of the confirmed infra-side drop causes (pods cycled 2026-06-09 20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts, not in-place binary swaps.	2026-06-10 21:00:05 +00:00
Viktor Barzin	ac6f19dd3b	tmux-persist: never let an empty snapshot clobber a saved manifest All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details emo's 5 web-terminal tmux sessions were OOM-killed (the server died, no reboot), and the 5-minute save tick then overwrote his session manifest with 0 bytes — wiping the record that restore needs. Root cause: the save guard only checked that the tmux socket file existed, but an OOM-killed server leaves a stale /tmp/tmux-<uid>/default behind; list-panes then returns nothing and that empty capture was installed over the good manifest. Because the restore service only runs at boot, an OOM (not a reboot) skips restore entirely, so the clobbered manifest was the only record left — and it was already gone. Fix: only overwrite <user>.tsv when the snapshot captured >=1 live session; otherwise keep the last good manifest (now covers no-server AND stale-socket/dead-server). Verified by reproducing the 0-byte clobber on the old script and confirming the new one preserves the manifest, plus a live save that still captures every active session. emo's 5 sessions were recovered from their transcripts and are back; this keeps the next OOM from destroying the manifest again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 20:38:59 +00:00
Viktor Barzin	9fff77cbea	Merge branch 'wizard/budget-rate-limit' Some checks failed ci/woodpecker/push/default Pipeline failed Details ci/woodpecker/push/build-cli Pipeline was successful Details	2026-06-10 19:42:19 +00:00
Viktor Barzin	acb847b858	actualbudget: dedicated traefik rate-limit (50/300) for budget ingresses The Actual web app boots with ~70 near-parallel requests (55 /data/migrations/.sql + statics, all served cache-control max-age=0 so every page load re-validates them). The shared rate-limit middleware (average 10, burst 50) 429s the tail of that storm, so every cold boot shows 'Server returned an error while checking its status' and every load stalls in retry backoff — measured up to 5min stalls when two loads from one IP overlap. Viktor asked to relax the limit after the anca slow-load investigation (beads code-7zv). Same pattern as immich: dedicated actualbudget-rate-limit middleware in the traefik stack, budget- ingresses opt out of the default via skip_default_rate_limit + extra_middlewares. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:36:42 +00:00
Viktor Barzin	8304ef0f70	Merge origin/master (pfsense SNI-routed internal 443) into forgejo/master All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Reconciles the two live infra remotes after the pve-host logging change landed on forgejo (which was a commit behind origin). Non-destructive merge — keeps both `eae35c51` (pfsense webmail SNI routing) and `aac807fb` (pve-host Loki shipping).	2026-06-10 19:35:55 +00:00
Viktor Barzin	aac807fb3a	pve-host: ship journal to Loki (snoopy command audit + sshd-pve) for emo's root SSH All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Emo's Claude agent was given root SSH to the Proxmox host (`ssh pve`, dedicated shared-root key emo-pve-agent@devvm) so he can manage the host — e.g. the R730 fan daemon — through his agent. To keep an audit trail of what that agent does, and to feed the long-pending Wave-1 S1 security rule, the PVE host now ships its systemd journal to cluster Loki: - snoopy logs every execve() to journald (identifier=snoopy), enabled via /etc/ld.so.preload; config scripts/pve-snoopy.ini. - promtail v3.5.1 (amd64) ships /var/log/journal to Loki as {job="pve-journal"} (full host journal; filter identifier="snoopy" for the command audit), and relabels sshd auth to {job="sshd-pve"} — which ACTIVATES S1 (it was PENDING only for lack of this shipper). Config/unit: scripts/pve-promtail.{yaml,service}. S1 won't false-fire on legitimate access: the devvm SNATs through pfSense to 192.168.1.2, which is already in the S1 source-IP allowlist. Loki is reached via an /etc/hosts pin (10.0.20.203 loki.viktorbarzin.lan); follow-up noted to register a Technitium CNAME so it auto-tracks LB renumbers. Host pieces are hand-managed (not Terraform), like fan-control and the rpi-sofia promtail — these files are the source of truth. Docs updated: security.md (S1 LIVE) and monitoring.md ("External host: pve"). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 19:31:45 +00:00
Viktor Barzin	eae35c511a	pfsense: SNI-routed internal 443 — mail.viktorbarzin.me serves webmail everywhere Completes the internal port table of the mail front door (10.0.20.1): 443 was squatted by the pfSense webGUI (self-signed cert expired 2022), so internal webmail and the kuma [External] mail probe hit the firewall login instead of Roundcube — the last leg of the mail split-brain name. Design (Viktor): route by what the client asked for. New HAProxy frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp): SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge pattern, no health check per the PROXY-probe gotcha); SNI of pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI, which moved to :8443 (invisible to habits — https://10.0.20.1 still lands on the login page; :8443 doubles as direct fallback). The reverse-proxy pfsense ingress now targets :8443 directly. Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified: bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI; pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me -> Roundcube with STRICT cert validation; :993 IMAPS untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:41:07 +00:00
Viktor Barzin	176a65d3d2	plotting-book: TF baseline image follows what CI actually builds Viktor asked to verify the book-plotting push->build->deploy chain. The chain itself is healthy, but the Terraform baseline image said ancamilea/book-plotter:latest while CI (GHA on PassionProjectsAnca/Plotting-Your-Dream-Book) builds and deploys viktorbarzin/book-plotter:<sha8> + :latest — a from-scratch apply would have resurrected a stale March image. Baseline now viktorbarzin/book-plotter:latest. No live change: the running tag is CI-owned via ignore_changes, plan confirms the image attr is ignored. [ci skip] deliberately: plan shows UNRELATED pre-existing drift on this stack (live ns labels managed-by=vault-user-onboarding + resource-governance/custom-quota=true would be stripped; deployment keel.sh/policy=patch annotations removed) — auto-applying that needs its own reviewed pass.	2026-06-10 18:37:14 +00:00
Viktor Barzin	5f7c2964ac	workstation: session-launch freshen follows the checked-out branch (not just master) Viktor asked to log Anca into her GitHub account so she can develop on the devvm and deploy her apps through the existing CI/CD. Her GitHub repos (Plotting-Your-Dream-Book, travel, My-Wardrobe — now cloned into her ~/code workspace) default to main, and the launcher freshen only fast-forwarded master, silently skipping them. ff the current branch's upstream instead — same safety gates (on a branch, clean tree, upstream configured, ff-only). Single-layout infra clones behave identically. [ci skip]	2026-06-10 18:20:59 +00:00
Viktor Barzin	de1d8b7bf3	technitium: add Brevo DKIM selector CNAMEs to internal zone [ci skip] The roundtrip probe kept failing after the SPF/MX fix: rspamd's actual junk-score driver was R_DKIM_PERMFAIL(+4.5) on selector brevo2 — Brevo signs with brevo1/brevo2._domainkey, which are CNAMEs to b{1,2}.viktorbarzin-me.dkim.brevo.com in public DNS and were absent from the internal zone (the earlier existence check used ANY queries, which Cloudflare refuses per RFC 8482 — false negative). The DKIM permfail also cascaded into DMARC_POLICY_SOFTFAIL(+1.5), totalling the 6.09/6.0 junk threshold; sieve filed probes into \Junk where the INBOX poll never finds them. ingress-dns-sync now maintains both selector CNAMEs. Ops notes: rspamd caches DNS (restart to flush after zone fixes); CoreDNS denial cache holds NXDOMAINs up to 300s. Verified: roundtrip SUCCESS in 20.5s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:07:38 +00:00
Viktor Barzin	2825cb1703	workstation: per-user code_layout — workspace puts project repos under ~/code (ancamilea + tripit) Viktor asked to restructure Anca's setup: her ~/code WAS the infra clone itself; he wants ~/code to be the directory where all her project repos (tripit etc.) live side by side, with infra moved to a subdirectory. - roster.yaml gains per-user 'code_layout: single\|workspace' + 'repos', validated + derived by roster_engine.py (12 new tests, 40 total). - t3-provision-users reconcile: auto-migrates a single-layout ~/code to ~/code/infra (running processes follow the moved inode), hoists nested project clones to the workspace root, clones roster repos from Forgejo AS the user (their PAT makes private repos work), and wires the documented forgejo remote + forgejo/master upstream into clones that predate that contract. - Fixed a latent TSV bug: empty jq @tsv fields collapse under tab-IFS read, shifting later fields left (groups was only safe by being the last field) — emit '-' sentinels instead. - start-claude.sh session freshen is layout-aware (freshens each repo under ~/code for workspace users). - managed claudeMd + AGENTS.md non-admin recipe + multi-tenancy.md updated in the same change. Applied live: ancamilea = workspace (infra at ~/code/infra, her existing tripit clone hoisted to ~/code/tripit, master upstream switched to forgejo/master); emo stays single layout, untouched. [ci skip]	2026-06-10 18:05:31 +00:00
Viktor Barzin	3b6a5c6737	workstation: worktree-first feature work for all agents [ci skip] Viktor asked that every feature task be developed in its own git worktree and merged into master when done, enabling multiple agents to work the same project concurrently. Encode the org rule in the managed claudeMd (self-deploys to /etc via the hourly reconcile), add the worktree-first paragraph to the AGENTS.md non-admin landing recipe, and gitignore .worktrees/ so per-feature worktrees can live at the repo root. Full lifecycle: ~/.claude/rules/execution.md §3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:49:43 +00:00
Viktor Barzin	daddafd279	docs: superset rule for the internal viktorbarzin.me zone (mail-auth records) [ci skip] Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:47:31 +00:00
Viktor Barzin	00bc1e052d	technitium: mirror mail-auth records into internal zone; fix redfish check [ci skip] Two fixes from the post-DNS-internalization health sweep: 1. The internal viktorbarzin.me zone served only ingress A/CNAME records. Since the mailserver pods now resolve the domain through it (CoreDNS viktorbarzin.me:53 -> Technitium, `59a531b8`), rspamd's SPF checks on inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the Brevo email-roundtrip probe failed from the 16:20 run onward (EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also maintains the static mail-auth records (SPF, brevo-code TXT, MX; DMARC + DKIM were already present), idempotently. Principle: the internal zone must be a SUPERSET of the public zone for every record type internal clients consume. Verified in-pod: all four types resolve; roundtrip re-probe green. 2. cluster_healthcheck #30 queried instant `up`, which goes stale for ~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant job -> intermittent false "redfish-idrac=missing". Now uses last_over_time(up[15m]) — same answers for fast jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:46:37 +00:00
Viktor Barzin	e7fbf986fb	workstation: rename tmux persistence out of the t3 namespace [ci skip] Viktor's correction: this feature is about the tmux web-terminal sessions, not t3 — t3 auto-saves its own threads (~/.t3 state + daily t3-backup-state). Renamed t3-tmux-sessions -> tmux-persist (units tmux-persist-save.timer / tmux-persist-restore.service, state /var/lib/tmux-persist), header rescoped to say exactly that. Same mechanism, correct taxonomy. Old units removed, state migrated, re-verified live (5 emo + 3 wizard sessions snapshotted). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:42:52 +00:00
Viktor Barzin	2e4f48f3fc	workstation: tmux sessions survive devvm reboots (save timer + boot restore) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor: emo's open web-terminal sessions must persist across reboots. Claude conversations were already durable on disk; the volatile part was the tmux wiring (which named session runs which conversation). t3-tmux-sessions save (5-min timer) snapshots every roster user's sessions to /var/lib/t3-tmux-state/<user>.tsv — conversation uuid taken from argv --resume (self-sustaining once restored) or the newest transcript in the cwd-slug project dir created after process start (fresh launcher sessions; claude does NOT hold its transcript fd open, so fd-sniffing was a dead end). t3-tmux-sessions restore (boot oneshot, also safe after partial loss) recreates missing sessions with claude --resume <uuid>. Reconciler self-heals both units' enablement. Verified live: emo's 5 sessions snapshotted with correct uuids; killed R730-cooling -> restore brought it back resuming the same conversation (context meter identical); other sessions untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:39:32 +00:00
Viktor Barzin	59a531b8e0	coredns: pods get internal split-horizon answers for viktorbarzin.me [ci skip] Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP (10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods become ordinary internal clients (CNAME -> apex -> live Traefik LB; mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma monitors that rode the TP-Link NAT loopback (hard-down since 06-09; loopback refuses flows whose source equals the reflection target, which all pfSense-SNAT'd cluster traffic does). Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic to LB IPs; verified from pods on three non-Traefik nodes) — re-verify after major k8s upgrades; canary = [External] fleet going red. The NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both fight return-path asymmetry and deepen TP-Link dependency. Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1, forgejo -> Traefik ClusterIP (pin kept for Technitium-outage resilience). Proxied [External] monitors now test the internal path — true edge fidelity moves to the external vantage (ha-london, next fix). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:21:34 +00:00
Viktor Barzin	35c89fa90c	workstation: managed Claude config self-deploys from the repo [ci skip] Viktor's claudeMd edits must keep reaching every user now that emo is out of the shared tree. Two reconciler additions: - sync_managed_config: installs scripts/workstation/managed-settings.json to /etc/claude-code whenever the repo copy changes — editing the org claudeMd is now edit + commit, no manual install step - refresh_codex_mirror: regenerates each user's ~/.codex/AGENTS.md (static mirror of the claudeMd; header-guarded so user-customized files are never clobbered) Verified live: corrupted emo's mirror -> reconcile restored it; wizard's stale mirror refreshed; in-sync managed config no-ops. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:03:24 +00:00
Viktor Barzin	8cfd0e5e5c	Merge forgejo/master: reconcile diverged lineages [ci skip] Local checkout carried the 2026-06-10 DNS/registry architecture series (pfSense forward-zone, CoreDNS viktorbarzin.me:53 carve-out, nodes stock) + vzdump/nfs-mirror/workstation-rebuild commits that never reached the canonical remote, while forgejo master received the emo-access series via isolated worktrees. Viktor asked to merge. Conflict resolutions (newest iteration wins in each file): - stacks/forgejo/cleanup.tf: LOCAL — dry_run=true (2026-06-10 revert after live retention orphaned OCI indexes; remote had 06-09 enable) - .claude/CLAUDE.md, docs/architecture/backup-dr.md: LOCAL — final registry/DNS architecture + implemented vzdump alerts - scripts/workstation/setup-devvm.sh: LOCAL — pinned-version, reproducible-rebuild refactor (kubelogin pin, restructured staging) - scripts/workstation/managed-settings.json: FORGEJO — the allow-then-audit claudeMd (matches /etc deployment byte-for-byte) - scripts/t3-provision-users.sh: FORGEJO comment; refresh_locked_clone intact [ci skip]: all stack changes in the local lineage were applied live this morning — CI would re-walk 100+ stacks via the modules/ fallback for zero state change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:21:50 +00:00
Viktor Barzin	a34f9ff3b8	docs: infra Woodpecker repo-82 ops — in-cluster webhook, secret parity, empty-commit gotcha [ci skip] Emo's first direct pushes surfaced three latent CI issues, all fixed out-of-band today and recorded here: webhook deliveries to ci.viktorbarzin.me timing out on the public-IP hairpin (hook now targets the in-cluster woodpecker-server service), repo 82 registered without the repo-scoped secret set (cloned from repo 1 in the DB), and empty commits compiling every workflow so missing secrets hard-error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:09:17 +00:00
Emil Barzin	63161ef3a5	test: final audit-pipeline verification Some checks failed ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/k8s-portal Pipeline failed Details ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/pve-nfs-exports-sync Pipeline was successful Details ci/woodpecker/push/registry-config-sync Pipeline was successful Details ci/woodpecker/push/build-ci-image Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Repo-82 Woodpecker secrets were missing (repo-1 set cloned over) and the webhook now targets the in-cluster service. This push should run the full pipeline: Slack audit ping + no-op apply.	2026-06-10 15:07:15 +00:00
Emil Barzin	619b7608fa	test: verify audit pipeline fires on emo push Second verification: the Forgejo->Woodpecker webhook was timing out on the public-IP hairpin (first test push fired no pipeline), so it now targets the in-cluster Woodpecker service. This push should produce a pipeline with the notify-nonadmin-push Slack step.	2026-06-10 15:03:48 +00:00
Emil Barzin	0f45585b53	test: verify emo direct master push (allow-then-audit) Viktor granted emo direct push to master on 2026-06-10 — any change allowed, tracked via commit messages + the Slack audit feed. This empty commit verifies the whitelist and exercises the new notify-nonadmin-push CI step end-to-end.	2026-06-10 14:54:04 +00:00
Viktor Barzin	a49d1eadf6	workstation: emo direct master push — allow-then-audit [ci skip] Viktor: emo may make any change; what matters is tracking what changed and why. ebarzin added to master push+merge whitelists (force-push stays disabled — append-only history). Tracking enforced three ways: - agent instructions (managed claudeMd + AGENTS.md): commit body MUST carry the user's plain-language intent; commits land on master directly; [ci skip] forbidden for non-admins - new notify-nonadmin-push step in .woodpecker/default.yml: Slack message for every non-admin master push (admin pushes silent) - PR flow remains the fallback for non-whitelisted users Accepted consequence (informed): emo's pushes auto-apply changed stacks via CI. Offboard runbook gains whitelist-removal step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:53:43 +00:00
Viktor Barzin	6d8773573c	workstation: agent-driven contribute flow for non-technical users [ci skip] emo can't use git — his agent must do all VCS mechanics invisibly. Managed claudeMd (every session, top precedence) now instructs agents: commit -> push <os-user>/<topic> branch -> open PR via Forgejo API (user's PAT from ~/.git-credentials) -> back to clean master -> tell the user in plain words it's submitted for review. AGENTS.md carries the full recipe with the curl call. Verified live as emo: PR #1 opened (HTTP 201, write:repository scope suffices) and closed via his PAT. Deployed to /etc/claude-code/managed-settings.json; codex AGENTS.md mirrors for emo + ancamilea regenerated from the new claudeMd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 10:12:26 +00:00
Viktor Barzin	2e5af5dc0e	workstation: keep non-admin infra clones fresh (hourly + at launch) [ci skip] Non-admins (emo) need current master without manual pulls. Two layers: - t3-provision-users reconcile gains refresh_locked_clone: fetch all remotes + ff-only master, guarded (on master, clean tree, upstream set); dirty/diverged clones are left alone with a WARN. - start-claude.sh freshens ~/code at session launch, 15s-capped so an offline remote never delays the session. Verified live on emo's clone: stale clone ff'd to tip by the reconciler; launcher snippet ff's when clean and refuses while a dirty file exists. Deployed to /usr/local/bin/t3-provision-users, /etc/skel/start-claude.sh, and emo's launcher. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:41:38 +00:00
Viktor Barzin	5d9417fbaa	workstation: emo contribute access + Phase-5 cutover done; gate master (push=apply) [ci skip] ADR-0004's premise was wrong: pushing master fires the Woodpecker apply pipeline (require_approval=forks only), so master pushes ARE deploys. Added Forgejo branch protection on master (push/merge whitelist=viktor, deploy keys allowed); non-admins contribute via branches + PRs. emo (ebarzin): write collaborator on viktor/infra, PAT in ~/.git-credentials, forgejo remote + upstream in his locked clone. Phase-5 finished: code-shared removed; ~/.claude symlinks kept (they ARE the skel shared-base mechanism — plan step 4c obsolete). Offboard runbook: revoke PAT + collaborator + group steps added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:30:41 +00:00
Viktor Barzin	a1b7b0ca53	forgejo retention: revert to DRY_RUN — first live run orphaned OCI indexes [ci skip] The keep-set (newest 10 versions + latest + cache tags) treats multi-arch/attestation index CHILDREN — separate untagged sha256 versions — as deletable: for images not rebuilt recently they sort outside the newest-10 window and were pruned while their kept parent index survived. kms-website :latest and :dfc83fb children 404'd (RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe within hours; deployed tag a794d1a unaffected). Healed: :latest re-pointed at the intact a794d1a index (also the newest commit), corrupt :dfc83fb version deleted, probe re-run clean (0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied live. Re-enable only with a container-aware keep-set — options in the post-mortem. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 09:22:47 +00:00
Viktor Barzin	e49c91e60c	monitoring: VzdumpBackup{Stale,NeverRun,Failing} alerts for the new VM-image backup vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl, mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success. NOT [ci]-applied: this is a Terraform stack change — arms on the next `scripts/tg apply` of the monitoring stack (metrics already flow, so it arms immediately once applied). Admin-gated apply per org policy. [ci skip] Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:10:46 +00:00

1 2 3 4 5 ...

4170 commits