infra

Author	SHA1	Message	Date
Viktor Barzin	3a59f4a8bf	workstation: per-user memory caps + systemd-oomd backstop on devvm All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details The shared devvm keeps overloading and had to be hard-killed again today (2026-06-22): a runaway in one user's ssh/tmux session (a 10G ugrep, plus stacked max-effort agents) grew unbounded, spilled into the disk swap, and swap-thrashed the throttled virtual disk into an IO storm until the box wedged. Root cause: ssh/tmux work runs under user-<uid>.slice, left memory-uncontained by the explicit 2026-06-10 "swap-only" decision, while only the t3-serve tree was capped. So one user could starve everyone. This bounds every user on BOTH trees (MemoryHigh=12G, MemoryMax=16G, MemorySwapMax=0 so work OOMs locally at its ceiling instead of thrashing swap), adds a systemd-oomd PSI backstop that sheds the single worst work cgroup under box-wide pressure while leaving system.slice (sshd/services/your way in) protected, gives system.slice a fair-share CPU/IO priority edge, and routes docker containers into a capped, oomd-policed docker.slice so they can't dodge the caps or mis-target oomd. All durable in setup-devvm.sh so a VM rebuild reproduces them; systemd-oomd added to packages.txt. Applied live and verified: oomctl shows the backstop armed (not dry-run) on the work slices with system.slice protected; a capped-balloon stress test OOM-killed locally at the ceiling with swap flat (no thrash). Post-mortem: docs/post-mortems/2026-06-22-devvm-mem-io-overload-containment.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 10:25:09 +00:00
Viktor Barzin	2169e0de5f	workstation: harden memory hooks — prune dead plugin refs + homelab-CLI-only store All checks were successful ci/woodpecker/push/default Pipeline was successful Details wire-memory-hooks.py now PRUNES any settings.json hook still pointing at the retired claude-memory plugin (plugins/claude-memory/hooks/) before the additive pass. install_memory() rm -rf's that dir, so those entries are dangling — and a missing UserPromptSubmit hook exits 2, a BLOCKING error that erases the prompt and froze emo's sessions (2026-06-22). The plugin shares basenames with the homelab hooks, so the old additive-only logic saw the dead plugin path as "already present" and skipped installing the real ~/.claude/hooks/ copy; pruning first fixes that. Verified against emo's exact original config: yields the correct 4-hook set, drops the dead PermissionRequest entry, idempotent on rerun. auto-learn.py now stores via the `homelab memory` CLI only — dropped the direct HTTP path and the local-SQLite fallback (memory is homelab-CLI-only per Viktor; never local files). No-ops silently when no API key is in env (e.g. ancamilea). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 09:24:42 +00:00
Viktor Barzin	a0897de7c3	workstation: document homelab-memory hooks + provisioner self-deploy [ci skip] multi-tenancy.md never mentioned the homelab-memory hooks rollout and still listed claude_memory credential injection as purely "future". Document what is actually true now: install_memory provisions the recall/auto-learn/compaction hooks per user, the provisioner binary self-deploys from the repo (step 0), the set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI defaults the URL) — emo has a key, ancamilea is keyless until one is minted. Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing edits self-deploy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:04:38 +00:00
Viktor Barzin	92f35550f2	workstation: self-deploy t3-provision-users from the repo each reconcile [ci skip] Root cause of emo's lost memory: nothing redeployed /usr/local/bin/t3-provision-users except the manual setup-devvm.sh, so the homelab-memory rollout (44562535/9aa2438e, Jun 21) sat committed-but-undeployed for a day — the hourly reconcile kept running the pre-memory binary and never wired the new memory hooks for emo/anca. Close the gap the same way the script already treats managed-settings.json and start-claude.sh (sync_managed_config / deploy_user_launcher): the repo is the authoring surface. At the top of the run, if the repo copy differs from the deployed binary, install it and re-exec the fresh one. Guards: a re-exec env flag (no loop), bash -n (never deploy a broken script), DRY_RUN (no mutation), cmp (no churn when unchanged). Verified across all four paths in isolation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 08:02:31 +00:00
Viktor Barzin	0b11a28d66	workstation: stop install_memory aborting the reconcile under set -e install_memory (added in `44562535`) ended with `[[ -d <plugin-dir> ]] && rm && log` and guarded a chmod with a bare `[[ -f settings ]] && chmod`. When the plugin dir or settings file is absent — the normal case for users who never had the claude-memory plugin — those return non-zero, and under `set -euo pipefail` the function returns non-zero and kills the whole hourly reconcile after the FIRST user, before the rest are processed. It never fired before because the rollout was committed but the deployed /usr/local/bin/t3-provision-users was never updated, so install_memory had never run. On first real run it aborted right after ancamilea, so emo (and wizard) never got their memory hooks wired — the reason emo's sessions lost memory. Wrap the cleanup in an if-block, guard the chmod, and end the function with return 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 07:59:47 +00:00
Viktor Barzin	9aa2438e75	workstation: run wire-memory-hooks as root, not runuser (fix non-admin wiring) install_memory ran the JSON-merge helper via 'runuser -u $user', but the helper lives under the admin's mode-700 home ($WORKSTATION_DIR) which non-admin users can't traverse -> wiring silently failed for emo/anca (hooks copied but never wired into settings.json). Run the helper as root (it reads both the repo helper and the user's home) and chown the result back to the user. Verified by the live all-users rollout: emo + anca now wired correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:45:36 +00:00
Viktor Barzin	44562535a2	workstation: provision homelab-memory hooks for all users (retire claude-memory MCP) Roll the wizard MCP->homelab-CLI memory migration out to every devvm user. Adds install_memory() to t3-provision-users.sh (mirrors install_playwright: per-user, idempotent, if-absent, as-the-user): installs the 4 memory hook scripts into ~/.claude/hooks, wires them into settings.json additively (wire-memory-hooks.py never touches env / the per-user MEMORY_API_KEY), and removes ONLY the claude_memory MCP + plugin if present. Reuses each user's existing key (no minting; per-user isolation stays deferred per the 2026-06-07 design). The homelab CLI hits the same remote HTTP API the MCP used; recall runs via the homelab-memory-recall.py UserPromptSubmit hook. Shared instructions (rules/skills symlinked from base; root+infra CLAUDE.md) already cover all users. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:42:42 +00:00
Viktor Barzin	334d8fee5d	setup-devvm: install + enable t3-migrate-idle (lib, script, units, timer) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:36:13 +00:00
Viktor Barzin	3cf09a0fe3	t3-migrate-idle: systemd oneshot + overnight timer (01:00-05:40, /20) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:35:19 +00:00
Viktor Barzin	af9f7be297	t3-migrate-idle: drain deferral markers when safe For each /var/lib/t3-autoupdate/deferred/<user> marker: skip+clear if the unit is gone or was already restarted after the deferral; otherwise, when the idle gate is satisfied, take a pre-restart backup and restart via the shared safe_restart_unit, clearing the marker on verified success. DRY_RUN logs decisions without acting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:34:44 +00:00
Viktor Barzin	06e400522f	t3-migrate-idle: idle gate (no in-flight turn + quiet buffer), TDD The gate reads t3's state.sqlite: safe to restart only when zero threads have an active_turn_id AND the most-recent thread activity is older than the quiet buffer (default 15m). Fail-closed on any parse/query error. Pure-bash unit tests cover the boundaries against fixture DBs (no root/bats/Docker). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:34:11 +00:00
Viktor Barzin	de97696ff0	t3-autoupdate: source the shared safe-restart lib + record deferrals Behavior-preserving refactor: the per-unit restart/recover body and small helpers now come from t3-safe-restart.sh (one audited copy). Additionally, when a unit is deferred for an active agent, write a marker under /var/lib/t3-autoupdate/deferred/ so the new idle migrator can drain it later; clear the marker on a successful restart. Install/health-gate/canary logic is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:32:57 +00:00
Viktor Barzin	2ab5b94748	t3-safe-restart: extract shared safe-restart library from t3-autoupdate Pull the per-unit backup->restart->verify->recover routine (and the small helpers it needs) out of t3-autoupdate.sh into a sourced library, so a second job (the upcoming idle migrator) can reuse the exact same audited recovery path instead of forking safety-critical code. safe_restart_unit returns non-zero on failure (after recovery+freeze) rather than exiting, so callers control flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:28:53 +00:00
Viktor Barzin	7bd4612edf	ci: scripts/tg waits out a contended state lock (-lock-timeout) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The infra CI pipeline was failing often — ~38% of the last 50 runs didn't succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack applies dying instantly with "Error acquiring the state lock". Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline skips a locked stack). Tier-1 stacks have no such fallback: they rely on terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same second), a human/agent applying locally, or the daily drift `plan`. Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT) on every state-locking verb (plan/apply/destroy/refresh), so a contended lock WAITS for the holder to finish instead of failing. -auto-approve behaviour for non-interactive applies is unchanged. Central wrapper change → covers CI, plus local human/agent applies; no CI image rebuild (tg is read from the repo). Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 00:15:39 +00:00
Viktor Barzin	600f1f933c	Create Claude auth state directories All checks were successful ci/woodpecker/push/default Pipeline was successful Details The first live renewal run showed systemd could not create state beneath a read-only home sandbox. Provision each user's writable state directory before enabling the timer so automatic renewal can run.	2026-06-20 20:25:55 +00:00
Viktor Barzin	ff67e9d422	Fix workstation package manifest parsing The approved Claude token renewal deployment could not run because setup-devvm passed inline package comments to apt as package names. Strip inline comments so the persisted all-user setup remains reproducible.	2026-06-20 20:22:05 +00:00
Viktor Barzin	5549fc3672	Add per-user Claude auth renewal Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.	2026-06-20 20:10:40 +00:00
Viktor Barzin	66caa0bf7f	homelab: v0.1 docs, distribution wiring, and version Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Completes v0.1: documentation, build/install path, and version stamping. - cli/VERSION (v0.1.0) stamped into the binary via ldflags. - cli/README.md rewritten as the homelab overview (verbs + tiers, manifest, build, the preserved legacy webhook use-cases). - docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the work/tf behaviour (native worktree entry, verification-gated auto-land, presence-coupled apply). - setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run (t3-dispatch pattern), so every devvm user gets the current binary. - AGENTS.md: discovery pointer under Common Operations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:25:51 +00:00
Viktor Barzin	c04efa3d3a	k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades) Some checks failed ci/woodpecker/push/default Pipeline failed Details Disruptive node drains should run when the cluster is idle. Move the k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC (00:00 London) — overnight, low usage, and clear of the kured OS-reboot window (01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.) - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *. - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot (was next_daily_noon_utc). - docs (runbook, architecture) + upgrade-state SKILL: schedule references updated to 23:00 UTC nightly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:16:32 +00:00
Viktor Barzin	bfb86e653f	k8s-version-upgrade: ignore CoreDNS preflight on `kubeadm upgrade plan` too All checks were successful ci/woodpecker/push/default Pipeline was successful Details The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's kubeadm binary is on the target version (the first attempt's apt step already bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under set -euo pipefail and abort: - preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the target version; the failing pipeline killed the whole preflight. - update_k8s.sh master path line 85 — bare `plan` before the apply. Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins. Verified read-only on master: plan exits 0 and still emits "kubeadm upgrade apply v1.34.9". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:49:06 +00:00
Viktor Barzin	037a609f27	k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's bundled corefile-migration table ("start version not supported"). - scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite our custom split-horizon Corefile with kubeadm's default AND downgrade the image; --skip-phases leaves CoreDNS 100% untouched while the control plane upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift. - stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight quiet-baseline (settle-window) check, which silently no-op'd on the ghcr claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open). - docs: runbook + architecture document the CoreDNS handling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:45:05 +00:00
Viktor Barzin	dfa1a12a86	k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall) The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing. On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the devvm) was firing when the daily detection ran; the preflight's "halt on any critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1). Two design gaps then turned that blip into a multi-day wedge: * the detection guard and spawn_next only checked whether the phase Job EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via ttlSecondsAfterFinished, so every daily run skipped re-spawning it; * the abort happens before the in-flight metric is pushed, so neither K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported "never ran" while actually being stuck. Fixes: D1 retry-on-failure: detection CronJob (main.tf) and spawn_next (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job instead of skipping it, so a transient gate self-corrects next cycle rather than wedging the pipeline for a week. D2 WebterminalTtydUnreachable critical -> warning: a devvm developer web-terminal is not cluster infrastructure and must not block upgrades. D3 observability: new K8sUpgradeChainJobFailed alert (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags a Failed chain Job as "chain failed" — closing the pre-in-flight blind spot so a wedge is visible immediately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:07:36 +00:00
Viktor Barzin	8a2a3d9eca	Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details # Conflicts: # scripts/t3-provision-users.sh	2026-06-16 22:32:43 +00:00
Viktor Barzin	0a6ed4b2fe	workstation: per-user playwright browser MCP for all users, reproducible from git Viktor asked that the playwright browser MCP be available for every devvm user in every directory, with each user running their own server and multiple concurrent sessions per user. Before this, playwright was hand-set-up per user (~/.config/systemd/user/ playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired — emo's and anca's servers ran but their ~/.claude.json had no playwright entry, so their Claude never connected. None of it was reproducible from git (units, refresh script, and the Vault snapshot token lived only in user homes), so a devvm rebuild would silently lose it. This makes it reproducible and fixes the unwired users: - roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931, allocated for every roster user incl. the admin), emitted in the derive JSON. - scripts/workstation/playwright/: system-level TEMPLATE units (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer}, User=%i — system manager, so no systemd --user / linger) + the refresh script. @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll footgun, same rationale as T3_PIN). - setup-devvm.sh: install the templates + script (9e); stage the chrome-service snapshot bearer token from Vault to a root file (8c) — the hourly root reconcile has no Vault token, mirrors the Claude OAuth staging in 8a. - t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and enable --now's the instances (idempotent, never restarts a running server). Also hardened the section-1 .env scan to skip the new playwright-.env files (no T3_PORT -> grep no-match would abort under set -e -o pipefail). - Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3. Supersedes the hand-made per-user --user units (one-time idle-gated migration to follow on the live host). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:33:47 +00:00
Viktor Barzin	044444d328	cluster-health: helm check #18 catches pending/failed releases (helm list -a) All checks were successful ci/woodpecker/push/default Pipeline was successful Details check_helm_releases used `helm list` without -a, which HIDES pending-upgrade and failed releases — so on 2026-06-16 check #18 reported "All deployed" while the prometheus release sat in pending-upgrade for ~4 days, silently blocking every monitoring terragrunt apply (frozen alert/rule config). Add -a to surface them and flag pending-* (FAIL, blocks applies) + failed (WARN); deployed/uninstalled/ superseded stay green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 15:39:06 +00:00
Viktor Barzin	e74f4208f5	t3-backup-state: retention 14 -> 6 (bound devvm root fs) All checks were successful ci/woodpecker/push/default Pipeline was successful Details wizard's state.sqlite grew to ~1.1GB and the new gated nightly tracker adds a pre-bump snapshot per bump on top of this daily one; 14 x ~1.1GB would fill the devvm root fs (was trending to ~16GB of wizard backups on a disk with ~9GB free). 6 is ample — rollback only ever needs the most recent pre-bump backup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:26:03 +00:00
Viktor Barzin	cdd9ecd199	t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Phase 4 docs for the enforcer -> gated-tracker change: - runbook t3-version-bump.md: rewritten around the tracker — how each bump is gated, plus freeze/revert/pin/dry-run/manual-rollback ops. - post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the gates close each named root-cause/lesson (historical sections left intact). - service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker; replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy 2026-06-16, cookieless -> 302 + t3_session). - t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:33:49 +00:00
Viktor Barzin	36521839fc	t3: gated nightly tracker (replaces pinned enforcer) + drop timer Persistent Phase 2 of "track t3 nightly, accept the risk, but make sure session auth works and revert if it breaks". Rewrites the daily t3-autoupdate from a pinned-version enforcer into a NIGHTLY TRACKER that gates every bump so a bad build self-heals instead of repeating 2026-06-09: - follows the t3@nightly npm dist-tag (T3_TRACK; T3_PIN still works as a hard freeze; /etc/t3-autoupdate.freeze is the manual revert switch); - downgrade-guard (the nightly tag is mutable — never move backward) + channel sanity (target must be a -nightly. build); - pre-bump per-user state.sqlite backup (online VACUUM INTO) BEFORE install, so rollback is a restore not sqlite surgery; - health-check now SEEDS a throwaway instance with a COPY of a real POPULATED state.sqlite, exercising the forward MIGRATION (the actual 2026-06-09 failure class) + the real mint->exchange->t3_session pairing handshake before trusting a build. Scratch dir is on /var/tmp (disk), not the 2G tmpfs /tmp; - canary rollout: restart idle instances ONE AT A TIME, verify pairing through the real dispatch after each, and on the first failure roll back (binary + that user's DB from the pre-bump backup) AND self-freeze so it can't re-flap onto bad builds. Active-agent instances are deferred, never killed. Rollback target is the recorded LAST-GOOD, not "whatever was installed"; - DRY_RUN mode (T3_DRY_RUN=1) previews the gate against a temp-prefix install — validated: 0.0.28-nightly.20260616.571 PASSES the populated-DB migration gate. timer: drop Persistent=true (a missed 04:00 must not fire a real bump on boot mid-day with users active — a 2026-06-09 contributing factor). setup-devvm.sh: install t3@nightly on fresh boxes (no state to break), in sync. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 10:08:12 +00:00
Viktor Barzin	994d305d04	t3: session-auth detection for the gated nightly tracker (dispatch fallback logging + Loki alerts) Some checks failed ci/woodpecker/push/default Pipeline failed Details Before auto-tracking t3 nightly builds (Viktor's call, risk accepted), stand up the detection that was missing on 2026-06-09 — when an auto-pulled nightly broke pairing for ALL users and nothing alerted. Viktor's explicit requirement: make sure session auth keeps working and revert if the pairing fallback/failure rate climbs. This is phase 0 (detection) of that work. - t3-dispatch: exchangeCredential now reports WHICH pairing endpoint answered, and autoPair logs every outcome (paired user=.. endpoint=.. fallback=..) — so the real-user browser-session->bootstrap fallback rate is observable. A non-zero rate flags that a build moved the pairing API (the 2026-06-09 class). - Loki ruler alerts (devvm journal -> Alertmanager -> Slack): T3PairingBroken (real users failing to pair), T3PairFallbackHigh (build moved the pairing API), T3AutoUpdateRolledBack / RollbackFailed / Frozen (enforcer outcomes). Closes the post-mortem's open "nothing monitors end-to-end pairing" detection gap. The existing t3-probe only checks GET /api/auth/session==200, which stays 200 even when pairing is dead, so it never caught the outage class. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:56:55 +00:00
Viktor Barzin	b0e8e3599f	nfs-mirror: exclude SQLite WAL/SHM sidecars + treat rsync exit 24 as success NfsMirrorFailing fired ~13% of nights (3/23 runs, all rsync exit 24). Root cause: calibre-web-automated keeps a WAL-mode SQLite queue.db on /srv/nfs, whose -wal/-shm sidecars are created/checkpointed/deleted constantly and vanish between rsync's file-list scan and the transfer ("file has vanished" -> exit 24). The mirror actually completes every run; only transient files disappear. Two fixes: (1) exclude -wal/-shm/*-journal -- these must never be in a raw mirror anyway (a WAL without an atomic .db snapshot is useless to restore; daily-backup makes the consistent SQLite copies). (2) Treat rsync exit 24 as success-with-warning so the run still appends to the offsite manifest (a code-24 night previously skipped that, delaying those changes to the monthly full sync) and the alert stops false-firing. Deployed to the PVE host via scp to /usr/local/bin/nfs-mirror (host script, not TF). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:34:22 +00:00
Emil Barzin	1ba453c65d	fan-control docs: sync runbook/env/service/design to the HA-actuator + anti-flap model All checks were successful ci/woodpecker/push/default Pipeline was successful Details The committed docs still described the 2026-06-04 presence-aware daemon. Bring them in line with what is actually deployed: HA computes the setpoint, the host is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias, anti-flap hold-last, and the new HA readout sensors (command/equilibrium/ cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone lost in the workstation reshuffle; re-created here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:11:48 +00:00
Emil Barzin	5bc3d27d1b	Merge remote-tracking branch 'forgejo/master' into emo/fan-control-ha-actuator All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-16 08:08:27 +00:00
Emil Barzin	2cfe338419	fan-control: hold last command through transient HA losses (stop fan flapping) The actuator dumped the fans to Dell auto on every brief loss of the HA command (~14% of the time, every few minutes) — crashing them to the ~7100 rpm floor and bouncing back: the "fans surge then crash then surge" the owner reported. Causes: the command sensors last_updated going >120s old whenever CPU temp sat flat (mis-read as stale), plus occasional unavailable blips. Fix: on a missing/stale command, HOLD the last applied % for up to HA_GRACE_SECS (300s) instead of falling back, and loosen STALE_SECS 120->1800 (staleness only happens at flat temp, where the held value is still valid). The 83C CPU CEILING on our own IPMI read stays the real overheat safety. Verified live: fallback 14% -> 0% over 8h, command std 16 -> 3, no more rpm floor crashes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:07:52 +00:00
Viktor Barzin	ef555c7e02	workstation: put ~/.local/bin on PATH so the launcher finds native claude All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in tmux's NON-login bash env, which doesn't source the user's shell rc where the native installer put ~/.local/bin on PATH. So `command -v claude` failed there → the launcher's bootstrap re-ran the native installer → the installer printed the PATH warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit, and t3-serve sets PATH in its unit — so only the terminal launcher was affected.) - skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent), before the launch logic — so `claude` is found, no reinstall, no warning. - setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile. - docs/architecture/multi-tenancy.md: documented the three PATH-injection points. Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:20:03 +00:00
Viktor Barzin	eecd78233b	workstation: standardize on the native claude install (drop npm-global + npx) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Question from Viktor: should claude run via the binary or npx? Answer: the native install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude; installMethod=native) — and every existing user had already auto-migrated to it, leaving the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup": - setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and just shadowed the per-user native installs). - t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap. - skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via the native installer (was an `npx @anthropic-ai/claude-code` fallback). - docs/architecture/multi-tenancy.md: documented the native-only runtime model. node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable + produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:12:05 +00:00
Viktor Barzin	bb3f5f2329	workstation: stop the Claude Code onboarding wizard reappearing for terminal users All checks were successful ci/woodpecker/push/default Pipeline was successful Details emo reported being "logged out" on terminal.viktorbarzin.me: every new shell dropped him at the first-run "Choose the text style" wizard, even though he'd used many sessions and is in fact fully authenticated. Root cause is NOT a logout — ~/.claude.json is a single file that all of a user's concurrent claude processes (the ttyd terminal + their t3-serve instance + agent sessions) read-modify-write, and a stale writer periodically drops top-level keys, including hasCompletedOnboarding. That bounces the next interactive session back to onboarding; credentials are safe in the separate ~/.claude/.credentials.json (which is why T3 kept working). wizard's own ~/.claude.json showed the same key loss, so this hits any heavy multi-session user. Fix: - skel/start-claude.sh: ensure_onboarding() idempotently re-asserts hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before launching claude. Merge-only (never clobbers other keys), runs as the user, and no-ops if jq is missing or the file is empty/corrupt. So even if the race drops the flag, the next launch restores it before claude reads it. - t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel only seeds the launcher at account creation, so without this the fix (and any future launcher edit) would never reach existing users. .tmux.conf is deliberately not re-copied — terminal-lobby appends a managed section to it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:37:59 +00:00
Emil Barzin	082bdfcc77	fan-control: thin actuator — HA computes the setpoint, host only applies it The R730 fan-control logic now lives entirely in Home Assistant: the curve thresholds, duty %, bias and asymmetric deadband, plus manual/lock, are set on the dashboard and published as sensor.r730_fan_command_pct. The host daemon is reduced to a thin actuator — it reads that one number each loop, validates it (numeric + not older than STALE_SECS) and applies it over IPMI. Removed the presence-aware two-curve logic and the garage-door coupling. Safety stays independent on the host: CPU>=CEILING, repeated IPMI failures, or HA unreachable/stale all hand the fans back to Dell auto. RPM telemetry now averages all 6 chassis fans. Deployed and verified live on pve (applies the HA command; fans follow). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 12:59:57 +00:00
Viktor Barzin	bdea34b992	offinfra-onboard: --dockerfile flag for non-root Dockerfiles All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details claude-memory-mcp's Dockerfile is at docker/Dockerfile, not repo root (infra#20 build failed: 'open Dockerfile: no such file or directory'). build.yml template gains file: {{DOCKERFILE}} (default ./Dockerfile). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 02:37:25 +00:00
Viktor Barzin	2f3c58dff1	claude-agent-service image -> ghcr across all five consumer stacks (infra#19) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details GHA now builds+pushes ghcr.io/viktorbarzin/claude-agent-service (public package, anonymous pulls). Repointed: claude-agent-service (deployment + git-init/seed-beads-agent inits), claude-breakglass, ci-pipeline-health, beads-server CronJobs, k8s-version-upgrade (tag var 2fd7670d -> latest — the Forgejo registry lost that sha; node caches were the only thing keeping those CronJobs alive). publish-gate: vendor-contact emails (licensing@/legal@/security@/sales@) ruled license-boilerplate, not PII. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 01:47:54 +00:00
Viktor Barzin	8aba3a0179	offinfra-onboard --no-deploy; wealthfolio-sync image -> ghcr (ADR-0002 infra#25) All checks were successful ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details broker-sync is a CronJob-only consumer (no deployment): new --no-deploy mode skips Woodpecker registration and renders build.yml without the deploy job — :latest+Always CronJobs pick up builds on the next run. wealthfolio stack: ghcr-credentials pull secret + image base repoint. The wealthfolio-sync image regains a reproducible rebuild path. Closes: code-62tm Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 01:39:35 +00:00
Viktor Barzin	e696957ebf	ci: ancestor guard on DIFF_BASE; gate allowlists the owner's work email [ci skip] Restarted infra pipelines after master moved diffed in REVERSE and re-applied stale trees (pipeline 148 reverted payslip-ingest's fresh ghcr config — repaired by the wave-2 agent). Only trust CI_PREV_COMMIT_SHA when it is an ancestor of HEAD. publish-gate: viktorbarzin@meta.com is the owner's own work email (same class as the allowlisted personal domain), not blockable PII — unblocks infra#18. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 00:31:33 +00:00
Viktor Barzin	72b5843e4b	publish-gate: exclude package-lock + beads tracker from email heuristic; beadboard image base -> ghcr All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details infra#17: the gate flagged npm deprecation boilerplate (package-lock.json escapes the *.lock filter) and the upstream fork author's email in tracked .beads data — both already-public upstream content, ruled false positives. Lock files excluded properly; .beads moved to the eyeball inventory. beads-server stack: beadboard image base repointed (deployment image is KEEL-ignored; no CronJobs use it). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 23:52:07 +00:00
Viktor Barzin	6b0d42c7bc	publish-gate + tuya-bridge ghcr cutover prep (ADR-0002 infra#15) Some checks failed ci/woodpecker/push/build-cli Pipeline was successful Details ci/woodpecker/push/default Pipeline failed Details publish-gate: gitleaks + trufflehog (full history) + PII heuristics; CLEAN verdict gates any public flip, DIRTY = stays private. tuya-bridge: ghcr-credentials pull secret + image base -> ghcr; namespace added to the ghcr-credentials allowlist as a safety net (new ghcr packages default PRIVATE even from public repos — prune after visibility flip). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 23:12:02 +00:00
Viktor Barzin	51682ee939	offinfra-onboard: require clean clone + ff to forgejo master first [ci skip] Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 23:00:55 +00:00
Viktor Barzin	09bb0b50a1	offinfra-onboard: forgejo token fallback to ~/.git-credentials [ci skip] job-hunter's clone uses the credential-store helper (no token embedded in the remote URL, unlike f1-stream). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 22:59:32 +00:00
Viktor Barzin	6f41de71fa	offinfra-onboard: normalize Woodpecker repo to untrusted [ci skip] Trusted repos get netrc injected into every step container; the non-root bitnami/kubectl deploy step dies with '//.netrc: Permission denied' (hit live on f1-stream's reactivated old-era repo 10, which carried trusted=true; tripit 167 is untrusted and works). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 22:32:08 +00:00
Viktor Barzin	beac1b57a3	offinfra-onboard: re-activate inactive Woodpecker registrations [ci skip] Hit live on f1-stream: the old GHA-era ViktorBarzin/f1-stream registration (repo 10) existed but was deactivated; the lookup matched it and skipped registration, leaving the deploy POST pointed at an inactive repo. Now checks .active and re-activates in place via forge_remote_id. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 22:28:03 +00:00
Viktor Barzin	baff3d7477	offinfra-onboard: per-repo GHA->ghcr migration tool + f1-stream ghcr pull secret All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details ADR-0002 tracer bullet (infra#13), per Viktor's go-ahead. Idempotent script: GitHub mirror repo (create/unarchive/visibility), GHA secrets via gh, Forgejo push-mirror (sync_on_commit) + initial sync, Woodpecker mirror registration, renders build.yml/deploy.yml from templates (single-manifest provenance:false, svu semver to Forgejo, ghcr keep-10 retention, Slack notify-failure, manual-event deploy), removes the old in-cluster build pipeline, commits on the Canonical side. f1-stream stack gains the ghcr-credentials imagePullSecret (first consumer). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 22:21:22 +00:00
Viktor Barzin	624747cc46	workstation: default Claude model fable-5 → opus-4-8 for all devvm users All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Viktor asked to make Opus the default for new Claude sessions — his own, Emo's, and Anca's — because Fable 5 is overkill for most daily tasks. The org-wide default lives in the managed-settings `model` key, which overrides each user's personal ~/.claude/settings.json model (and no per-user launcher passes --model anymore). So flipping this one value makes every user's NEXT session default to Opus 4.8; current sessions keep their model, and a per-session /model still overrides as before. The hourly t3-provision-users reconcile deploys it to /etc/claude-code/managed-settings.json within the cycle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 20:59:03 +00:00
Viktor Barzin	df332b59e6	break-glass SSH: drop port-knock for exposed key-only :52222; version host config Viktor got locked out of the break-glass path (forgot the port-knock setup) and deleted the edge-router forwards, then asked to review and redesign it from scratch. Root cause of the lockout: the knock added no real security (key-only SSH is already brute-force-proof) and its only benefit — hiding the port — came at the cost of a circular dependency. The knock sequence lived only in in-cluster Vault, which is unreachable in the exact away/cold scenario break-glass exists for. So the unlock secret was unavailable precisely when needed. New model (self-contained, nothing to remember): plain key-only SSH on the Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222 -> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000 - it rejects remaps; port 22 itself is reserved). The exposed port trusts only a dedicated break-glass key via `Match LocalPort` (a leak of any other root key does not grant internet access), rate-limited (iptables hashlimit) + fail2ban. - Removed knockd (package + config) and the legacy Synology SSH forward (ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone). - Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd - the stock journalmatch silently never banned). - Versioned the host config in scripts/ (it was applied ad-hoc, never committed) and recorded the deliberate Wave-1 "no public-IP" exception in security.md + .claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:23:39 +00:00

1 2 3 4 5

231 commits