infra

Author	SHA1	Message	Date
Viktor Barzin	8a2a3d9eca	Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details # Conflicts: # scripts/t3-provision-users.sh	2026-06-16 22:32:43 +00:00
Viktor Barzin	63e714782c	immich: remove one-shot anca-elements-import Job + its PVC All of Anca's photos are imported. The Job was declared as kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of the immich stack re-created it, despite the 2026-05-25 in-code comment saying "After successful completion: REMOVE this resource block + apply again." Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the 6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver. Recovery actions taken before this commit: - Throttled nfsd 64→8 on PVE host to give apiserver headroom - `kubectl delete job -n immich anca-elements-import` + force-delete pod - Restored nfsd to 64; cluster healthy Code change here: - Remove `kubernetes_job_v1.anca_elements_import` block - Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` — no live consumer; videos batch deferred per user, source dump remains on PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin) - Update 2026-05-25 post-mortem: 6th-incident section + new lesson that one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob or a runbook-captured `kubectl create job` ad-hoc invocation instead).	2026-06-16 22:11:27 +00:00
Viktor Barzin	88717c61fd	immich-frame: whole library (last 2y), Ken Burns, weather, 30s interval All checks were successful ci/woodpecker/push/default Pipeline was successful Details Per Viktor: show the whole Immich library from the last 2 years instead of the single 'china' album, enable Ken Burns pan/zoom, slow the interval to 30s, and add the weather overlay (London, metric). OpenWeatherMap key is read from Vault (secret/immich -> frame_weather_api_key), not hardcoded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 21:07:39 +00:00
Viktor Barzin	cffa32fae3	Merge remote-tracking branch 'forgejo/master' into wizard/tripit-ingest-model All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-16 20:39:30 +00:00
Viktor Barzin	14476bfbd7	tripit: mail-ingest extracts with the qwen3-8b text model, not the vision model Forwarded schedule-change emails were being parsed by qwen3vl-4b (a 4B vision model) for text extraction, which reliably dropped the flight number — so the matcher had no key to link on and a forwarded flight update created a duplicate instead of amending the existing segment. Point the ingest-plans CronJob's text extraction at qwen3-8b (verified live: it emits flight_number + a clean PNR, 3/3 on the failing email) and keep qwen3vl-4b for boarding-pass image attachments (LLM_VISION_MODEL). llama-swap loads each on demand; the GPU swap cost is accepted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:39:29 +00:00
Viktor Barzin	0a6ed4b2fe	workstation: per-user playwright browser MCP for all users, reproducible from git Viktor asked that the playwright browser MCP be available for every devvm user in every directory, with each user running their own server and multiple concurrent sessions per user. Before this, playwright was hand-set-up per user (~/.config/systemd/user/ playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired — emo's and anca's servers ran but their ~/.claude.json had no playwright entry, so their Claude never connected. None of it was reproducible from git (units, refresh script, and the Vault snapshot token lived only in user homes), so a devvm rebuild would silently lose it. This makes it reproducible and fixes the unwired users: - roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931, allocated for every roster user incl. the admin), emitted in the derive JSON. - scripts/workstation/playwright/: system-level TEMPLATE units (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer}, User=%i — system manager, so no systemd --user / linger) + the refresh script. @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll footgun, same rationale as T3_PIN). - setup-devvm.sh: install the templates + script (9e); stage the chrome-service snapshot bearer token from Vault to a root file (8c) — the hourly root reconcile has no Vault token, mirrors the Claude OAuth staging in 8a. - t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and enable --now's the instances (idempotent, never restarts a running server). Also hardened the section-1 .env scan to skip the new playwright-.env files (no T3_PORT -> grep no-match would abort under set -e -o pipefail). - Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3. Supersedes the hand-made per-user --user units (one-time idle-gated migration to follow on the live host). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:33:47 +00:00
Viktor Barzin	c6a5cbe227	feat(tripit): serve the SPA publicly, keep /api + /metrics forward-auth-gated (ADR-0020 landing) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details The website 302'd unauthenticated visitors straight to Authentik. Split the tripit.viktorbarzin.me ingress: the SPA shell (everything else) becomes auth=none so the app shows its own Log in / Sign up landing page, while a new tripit-app-api ingress keeps /api + /metrics behind forward-auth — the security boundary, since /api trusts the outpost-injected X-authentik-email. The public SPA gets strip-auth-headers (no spoofed headers can reach the backend) and anti_ai_scraping=false (it's an installable PWA). The existing auth=none carve-outs (calendar, emails/confirm, planner/slack) are longer prefixes and keep winning. Pairs with the tripit landing-page deploy (commit 3fe4da1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:30:58 +00:00
github-actions[bot]	eb47eb1d10	priority-pass: bump image_tag to 63e118c3 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `63e118c334`	2026-06-16 17:45:33 +00:00
github-actions[bot]	d1f2e50736	priority-pass: bump image_tag to 4ce9e8e8 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `4ce9e8e894`	2026-06-16 17:44:40 +00:00
github-actions[bot]	46b5f04f67	priority-pass: bump image_tag to 63e118c3 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `63e118c334`	2026-06-16 17:20:08 +00:00
github-actions[bot]	29ad200026	priority-pass: bump image_tag to 4ce9e8e8 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `4ce9e8e894`	2026-06-16 17:19:55 +00:00
Viktor Barzin	044444d328	cluster-health: helm check #18 catches pending/failed releases (helm list -a) All checks were successful ci/woodpecker/push/default Pipeline was successful Details check_helm_releases used `helm list` without -a, which HIDES pending-upgrade and failed releases — so on 2026-06-16 check #18 reported "All deployed" while the prometheus release sat in pending-upgrade for ~4 days, silently blocking every monitoring terragrunt apply (frozen alert/rule config). Add -a to surface them and flag pending-* (FAIL, blocks applies) + failed (WARN); deployed/uninstalled/ superseded stay green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 15:39:06 +00:00
Viktor Barzin	e74f4208f5	t3-backup-state: retention 14 -> 6 (bound devvm root fs) All checks were successful ci/woodpecker/push/default Pipeline was successful Details wizard's state.sqlite grew to ~1.1GB and the new gated nightly tracker adds a pre-bump snapshot per bump on top of this daily one; 14 x ~1.1GB would fill the devvm root fs (was trending to ~16GB of wizard backups on a disk with ~9GB free). 6 is ample — rollback only ever needs the most recent pre-bump backup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:26:03 +00:00
Viktor Barzin	cdd9ecd199	t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Phase 4 docs for the enforcer -> gated-tracker change: - runbook t3-version-bump.md: rewritten around the tracker — how each bump is gated, plus freeze/revert/pin/dry-run/manual-rollback ops. - post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the gates close each named root-cause/lesson (historical sections left intact). - service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker; replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy 2026-06-16, cookieless -> 302 + t3_session). - t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:33:49 +00:00
Viktor Barzin	f4f7705127	monitoring: adopt orphaned alert-digest resources into TF state (unblocks apply) The monitoring stack apply was create-failing on every push with `configmaps "alert-digest-script" already exists` + `secrets "alert-digest" already exists` (modules/monitoring/alert_digest.tf) — both resources exist in-cluster but fell out of Terraform state, so apply tried to CREATE them and errored. Pre-existing (failed on pipelines 203 AND 204, NOT caused by the t3 alert-rules change). Add import {} blocks (TF 1.5+ adoption per AGENTS.md) so apply imports + reconciles instead of failing. Idempotent once imported; safe to remove after a green apply. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:31:17 +00:00
Viktor Barzin	36521839fc	t3: gated nightly tracker (replaces pinned enforcer) + drop timer Persistent Phase 2 of "track t3 nightly, accept the risk, but make sure session auth works and revert if it breaks". Rewrites the daily t3-autoupdate from a pinned-version enforcer into a NIGHTLY TRACKER that gates every bump so a bad build self-heals instead of repeating 2026-06-09: - follows the t3@nightly npm dist-tag (T3_TRACK; T3_PIN still works as a hard freeze; /etc/t3-autoupdate.freeze is the manual revert switch); - downgrade-guard (the nightly tag is mutable — never move backward) + channel sanity (target must be a -nightly. build); - pre-bump per-user state.sqlite backup (online VACUUM INTO) BEFORE install, so rollback is a restore not sqlite surgery; - health-check now SEEDS a throwaway instance with a COPY of a real POPULATED state.sqlite, exercising the forward MIGRATION (the actual 2026-06-09 failure class) + the real mint->exchange->t3_session pairing handshake before trusting a build. Scratch dir is on /var/tmp (disk), not the 2G tmpfs /tmp; - canary rollout: restart idle instances ONE AT A TIME, verify pairing through the real dispatch after each, and on the first failure roll back (binary + that user's DB from the pre-bump backup) AND self-freeze so it can't re-flap onto bad builds. Active-agent instances are deferred, never killed. Rollback target is the recorded LAST-GOOD, not "whatever was installed"; - DRY_RUN mode (T3_DRY_RUN=1) previews the gate against a temp-prefix install — validated: 0.0.28-nightly.20260616.571 PASSES the populated-DB migration gate. timer: drop Persistent=true (a missed 04:00 must not fire a real bump on boot mid-day with users active — a 2026-06-09 contributing factor). setup-devvm.sh: install t3@nightly on fresh boxes (no state to break), in sync. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 10:08:12 +00:00
Viktor Barzin	994d305d04	t3: session-auth detection for the gated nightly tracker (dispatch fallback logging + Loki alerts) Some checks failed ci/woodpecker/push/default Pipeline failed Details Before auto-tracking t3 nightly builds (Viktor's call, risk accepted), stand up the detection that was missing on 2026-06-09 — when an auto-pulled nightly broke pairing for ALL users and nothing alerted. Viktor's explicit requirement: make sure session auth keeps working and revert if the pairing fallback/failure rate climbs. This is phase 0 (detection) of that work. - t3-dispatch: exchangeCredential now reports WHICH pairing endpoint answered, and autoPair logs every outcome (paired user=.. endpoint=.. fallback=..) — so the real-user browser-session->bootstrap fallback rate is observable. A non-zero rate flags that a build moved the pairing API (the 2026-06-09 class). - Loki ruler alerts (devvm journal -> Alertmanager -> Slack): T3PairingBroken (real users failing to pair), T3PairFallbackHigh (build moved the pairing API), T3AutoUpdateRolledBack / RollbackFailed / Frozen (enforcer outcomes). Closes the post-mortem's open "nothing monitors end-to-end pairing" detection gap. The existing t3-probe only checks GET /api/auth/session==200, which stays 200 even when pairing is dead, so it never caught the outage class. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:56:55 +00:00
Viktor Barzin	e783cae2cb	chrome-service + mam-farming: doc clarifications (+ re-trigger CI apply missed earlier) Some checks failed ci/woodpecker/push/default Pipeline failed Details Two small doc additions that also re-include these stacks in Woodpecker's changed-stack detection. The earlier 2-commit push left chrome-service out of the HEAD~1..HEAD diff so its ignore_changes fix never applied; the monitoring apply was separately blocked by a stuck prometheus pending-upgrade (now cleared). - chrome-service: note the live pod's container order had drifted from this file's order, so a TF apply reorders them (containers[0] differs live-vs-TF until the apply lands) -- documents the confusion this caused during diagnosis. - mam-farming: cross-ref the grabber script that emits mam_grabber_last_run_timestamp. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:34:23 +00:00
Viktor Barzin	b0e8e3599f	nfs-mirror: exclude SQLite WAL/SHM sidecars + treat rsync exit 24 as success NfsMirrorFailing fired ~13% of nights (3/23 runs, all rsync exit 24). Root cause: calibre-web-automated keeps a WAL-mode SQLite queue.db on /srv/nfs, whose -wal/-shm sidecars are created/checkpointed/deleted constantly and vanish between rsync's file-list scan and the transfer ("file has vanished" -> exit 24). The mirror actually completes every run; only transient files disappear. Two fixes: (1) exclude -wal/-shm/*-journal -- these must never be in a raw mirror anyway (a WAL without an atomic .db snapshot is useless to restore; daily-backup makes the consistent SQLite copies). (2) Treat rsync exit 24 as success-with-warning so the run still appends to the offsite manifest (a code-24 night previously skipped that, delaying those changes to the monthly full sync) and the alert stops false-firing. Deployed to the PVE host via scp to /usr/local/bin/nfs-mirror (host script, not TF). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:34:22 +00:00
Viktor Barzin	2479560fa2	mam-farming: make MAMFarmingStuck a grabber heartbeat, not a grab-count check Some checks failed ci/woodpecker/push/default Pipeline failed Details MAMFarmingStuck fired whenever the freeleech grabber added 0 torrents in 4h, but grabbing 0 is normal: the grabber searches a random catalogue offset each run and legitimately finds nothing when freeleech is dry (account ratio was a healthy 37.5; the alert even misreported it as "0.00" because $value was the grabbed count, not the ratio). The alert's real intent was to catch the grabber not running at all (CronJob Forbid-blocked / wedged), but increase(grabbed[4h])==0 cannot distinguish "didn't run" from "ran, nothing to grab" since Pushgateway serves the last pushed value forever. The grabber now heartbeats mam_grabber_last_run_timestamp on every completed run (main success, ratio/mouse skip, and qBittorrent-unreachable paths). The alert fires only when that heartbeat is >4h stale — the true stuck condition. Cookie expiry and qBittorrent-down keep their own dedicated alerts. Surfaced by /cluster-health as a false-firing alert. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:18:33 +00:00
Viktor Barzin	a0725ede57	chrome-service: stop ignoring container[0].image so TF re-asserts the pinned browser image The chrome-service container (container[0]) runs the pinned Microsoft Playwright image, which ships chromium under /ms-playwright. Its image was still listed in the deployment's lifecycle ignore_changes — a leftover KEEL_IGNORE from before ADR-0002 #29 moved the novnc container to TF management. With that field ignored, a stray clobber of container[0] to ghcr chrome-service-novnc:latest (which has no chromium there) stuck permanently: the container crash-looped ~12h on "chromium binary not found under /ms-playwright" (273 restarts) and TF could not revert it. Remove container[0].image from ignore_changes so Terraform pins it to local.image and re-asserts it on every apply. Both containers are TF-managed now (novnc since ADR-0002 #29); Keel is inert (policy=never), so nothing should fight TF here. Surfaced by /cluster-health. Live state was already restored transiently via kubectl set image; this commit makes the fix durable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:18:32 +00:00
Emil Barzin	1ba453c65d	fan-control docs: sync runbook/env/service/design to the HA-actuator + anti-flap model All checks were successful ci/woodpecker/push/default Pipeline was successful Details The committed docs still described the 2026-06-04 presence-aware daemon. Bring them in line with what is actually deployed: HA computes the setpoint, the host is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias, anti-flap hold-last, and the new HA readout sensors (command/equilibrium/ cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone lost in the workstation reshuffle; re-created here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:11:48 +00:00
Emil Barzin	5bc3d27d1b	Merge remote-tracking branch 'forgejo/master' into emo/fan-control-ha-actuator All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-16 08:08:27 +00:00
Emil Barzin	2cfe338419	fan-control: hold last command through transient HA losses (stop fan flapping) The actuator dumped the fans to Dell auto on every brief loss of the HA command (~14% of the time, every few minutes) — crashing them to the ~7100 rpm floor and bouncing back: the "fans surge then crash then surge" the owner reported. Causes: the command sensors last_updated going >120s old whenever CPU temp sat flat (mis-read as stale), plus occasional unavailable blips. Fix: on a missing/stale command, HOLD the last applied % for up to HA_GRACE_SECS (300s) instead of falling back, and loosen STALE_SECS 120->1800 (staleness only happens at flat temp, where the held value is still valid). The 83C CPU CEILING on our own IPMI read stays the real overheat safety. Verified live: fallback 14% -> 0% over 8h, command std 16 -> 3, no more rpm floor crashes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:07:52 +00:00
Viktor Barzin	57d45d8d8f	fix(authentik): pin Vault binding UUIDs as literals (provider has no authentik_application data source) All checks were successful ci/woodpecker/push/default Pipeline was successful Details CI pipeline 198 failed: the pinned goauthentik/authentik provider has no data "authentik_application" source, so terraform failed the whole authentik plan and applied NOTHING (state unchanged). Replace the data-source lookups with the live pbm_uuid (Vault app) and group_uuid (Allow Login Users) as literals; authentik_policy_binding is supported (used in guest.tf). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:01:29 +00:00
Viktor Barzin	aa461b95bc	feat(authentik): bind Vault OIDC app to Allow Login Users (close ADR-0020 OIDC gap) Some checks failed ci/woodpecker/push/default Pipeline failed Details Audit found the Vault Authentik application had no authorization binding, so any authenticated identity (incl. a future self-enrolled TripIt External user) could complete Vault OIDC login and get a built-in default-policy token. Bind it to 'Allow Login Users' — existing homelab users inherit that group via its children (verified User.all_groups() includes the parent), parentless TripIt External users are excluded. Closes the only OIDC app the forward-auth fence does not cover. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 21:48:04 +00:00
Viktor Barzin	cbca281aaa	feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020) Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 21:48:04 +00:00
Viktor Barzin	cf51cb45de	docs(adr-0003): keep Forgejo canonical, complete the GitHub mirror (reject swap) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Grilled the 'swap Forgejo for GitHub' idea. Root cause of the divergence pain is an incomplete push-mirror rollout (14 repos dual-pushed, push_mirrors=0), not Forgejo itself — and CONTEXT.md already documents Forgejo-canonical + one-way GitHub mirror. Decision: don't swap; finish the mirror, name the GitHub-first exceptions, reconcile infra, enforce one-remote-per-clone. Adds ADR-0003 + the GitHub-first repo glossary term + dual-push/force-overwrite warnings on Canonical repo / GitHub mirror. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-15 21:32:28 +00:00
Viktor Barzin	5d3a166b94	t3-afk: fix agent Bash — stop mounting into ~/.claude Some checks failed ci/woodpecker/push/default Pipeline failed Details Root cause of "the agent never commits": the issue-implementer CLAUDE.md was subPath-mounted at /home/node/.claude/CLAUDE.md, which made /home/node/.claude root-owned. The agent (uid 1000) then couldn't create its Bash session-env there, so EVERY Bash/git call failed (Write/Edit worked, so it silently edited but never committed). Found by reading the agent transcripts from state.sqlite -> projection_thread_messages. Fix: don't mount anything into ~/.claude (it's not honored by T3's SDK anyway). Behaviour is injected via the dispatch message preamble by the control plane; files/issue-implementer-CLAUDE.md kept as the canonical source text. Verified post-fix: a preamble-dispatched task edited README and COMMITTED (073ab28) unattended. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:49:34 +00:00
Viktor Barzin	34c30ac2bf	t3-afk: auto-pair dispatcher sidecar — no manual pairing All checks were successful ci/woodpecker/push/default Pipeline was successful Details The bare `t3 serve` behind Authentik showed the manual /pair#token screen, which didn't connect. Mirror the devvm t3-dispatch: a small stdlib-Node sidecar fronts t3 serve, and on a cookieless (already Authentik-gated) document load it mints a pairing credential (`t3 auth pairing create`) and exchanges it at /api/auth/browser-session for the t3_session cookie, then 302s back. Everything else — including WebSocket upgrades for the live cockpit — reverse-proxies to :3773. The Service now targets the sidecar (:8080). Verified: cookieless GET -> 302 + Set-Cookie t3_session; cookied GET -> 200 SPA. Matches the t3.viktorbarzin.me experience (Authentik login -> straight into the cockpit). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:19:39 +00:00
Viktor Barzin	92c5b24975	docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Minted a dedicated classic GitHub PAT scoped to read:packages and stored it in Vault secret/viktor/ghcr_pull_token (2026-06-15), replacing the previous alias of the broad admin github_pat. Propagated via targeted apply of module.kyverno.kubernetes_secret.ghcr_credentials (Kyverno re-syncs the allowlisted namespaces). Document the new cred + the manual rotation recipe. Closes: code-h2il Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-15 20:19:17 +00:00
Viktor Barzin	ef555c7e02	workstation: put ~/.local/bin on PATH so the launcher finds native claude All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in tmux's NON-login bash env, which doesn't source the user's shell rc where the native installer put ~/.local/bin on PATH. So `command -v claude` failed there → the launcher's bootstrap re-ran the native installer → the installer printed the PATH warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit, and t3-serve sets PATH in its unit — so only the terminal launcher was affected.) - skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent), before the launch logic — so `claude` is found, no reinstall, no warning. - setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile. - docs/architecture/multi-tenancy.md: documented the three PATH-injection points. Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:20:03 +00:00
Viktor Barzin	eecd78233b	workstation: standardize on the native claude install (drop npm-global + npx) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Question from Viktor: should claude run via the binary or npx? Answer: the native install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude; installMethod=native) — and every existing user had already auto-migrated to it, leaving the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup": - setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and just shadowed the per-user native installs). - t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap. - skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via the native installer (was an `npx @anthropic-ai/claude-code` fallback). - docs/architecture/multi-tenancy.md: documented the native-only runtime model. node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable + produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:12:05 +00:00
Viktor Barzin	4a48f065e9	mcp: drop project-scoped paperless from .mcp.json (paperless is now wizard-only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Paperless is a personal tool for wizard, not shared. It was project-scoped in the infra repo's .mcp.json (the in-cluster paperless-mcp proxy), so every user whose ~/code IS an infra clone (emo, ancamilea) auto-loaded it. Per request, paperless should be wizard-only: wizard now runs his own direct, token-based paperless MCP in his user-scope config (a local barryw/paperlessmcp container -> paperless-ngx). Removing the shared entry so emo and other infra-clone users no longer get it; the `ha` MCP stays project-scoped. emo's clone drops it on next freshen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:03:37 +00:00
Viktor Barzin	bb3f5f2329	workstation: stop the Claude Code onboarding wizard reappearing for terminal users All checks were successful ci/woodpecker/push/default Pipeline was successful Details emo reported being "logged out" on terminal.viktorbarzin.me: every new shell dropped him at the first-run "Choose the text style" wizard, even though he'd used many sessions and is in fact fully authenticated. Root cause is NOT a logout — ~/.claude.json is a single file that all of a user's concurrent claude processes (the ttyd terminal + their t3-serve instance + agent sessions) read-modify-write, and a stale writer periodically drops top-level keys, including hasCompletedOnboarding. That bounces the next interactive session back to onboarding; credentials are safe in the separate ~/.claude/.credentials.json (which is why T3 kept working). wizard's own ~/.claude.json showed the same key loss, so this hits any heavy multi-session user. Fix: - skel/start-claude.sh: ensure_onboarding() idempotently re-asserts hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before launching claude. Merge-only (never clobbers other keys), runs as the user, and no-ops if jq is missing or the file is empty/corrupt. So even if the race drops the flag, the next launch restores it before claude reads it. - t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel only seeds the launcher at account creation, so without this the fix (and any future launcher edit) would never reach existing users. .tmux.conf is deliberately not re-copied — terminal-lobby appends a managed section to it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:37:59 +00:00
Viktor Barzin	82a0c5aedf	t3-afk: fix crashloop — exclude from Keel at the deployment level All checks were successful ci/woodpecker/push/default Pipeline was successful Details Keel "patch"-downgraded the image docker.io/library/node:24 -> library/node:24.0.2, which is below t3@0.0.27's required node >=24.10, so `t3 serve` exited silently and the pod crash-looped (~160 restarts / 13h). Root cause: keel.sh/policy=never was on the POD-TEMPLATE labels, but Keel reads the policy at the DEPLOYMENT level. The cluster's Kyverno inject-keel-annotations is opt-out, so it stamped policy=patch and Keel acted on it. Fix: set keel.sh/policy=never as a deployment-level annotation; ignore_changes the Kyverno-injected keel.sh/pollSchedule + keel.sh/trigger annotations; the image stays TF-owned (apply reverted Keel's downgrade). Pod now 1/1, t3 serve 200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 10:32:38 +00:00
Viktor Barzin	214638216b	fix(anisette): wait_for_rollout=false so a slow first start can't strand the deploy out of state All checks were successful ci/woodpecker/push/default Pipeline was successful Details The docker.io fix created the deployment, but wait_for_rollout (default true) then hung on the OOMing pod and the apply failed — leaving the deployment in the cluster but NOT in terraform state, so every later apply hit 'deployments.apps "anisette" already exists'. Deleted that orphan and set wait_for_rollout=false (mirrors tts/llama-cpp slow-start services); readiness probe still gates Service traffic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:56:30 +00:00
Viktor Barzin	d8c60d7ab8	t3-afk: dedicated in-cluster T3 Code instance (AFK executor + cockpit) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Slice #2 of claude-agent-service PRD #1 (AFK implementation pipeline). Dedicated in-cluster T3 Code instance the control plane dispatches issues into; runs the issue-implementer agent in a git worktree with a live cockpit. Applied + live 2026-06-14 (9 resources). Pilot-fast: stock docker.io/library/node:24 + install pinned t3@0.0.27 + Claude CLI at startup onto an SSD-NFS PVC. Authentik-gated ingress. issue-implementer behaviour ships as a user-level ~/.claude/CLAUDE.md (T3 hardcodes the system prompt; settingSources loads it) and forbids plan-mode/clarifying-questions so unattended threads don't stall. Keel-excluded (ADR 0003). wait_for_rollout=false (slow first start). Image fully-qualified for the Kyverno trusted-registries allowlist; container mem limit 4Gi (tier-aux LimitRange cap). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:06:33 +00:00
Viktor Barzin	bc7b28244f	fix(anisette): raise memory limit to 512Mi — 128Mi OOMKilled at startup Some checks failed ci/woodpecker/push/default Pipeline failed Details The pod CrashLooped with OOMKilled (exit 137): anisette downloads and initializes Apple's CoreADI provisioning library on startup, spiking past the 128Mi limit before it can bind :6969 (empty logs, liveness 'connection refused'). Bump request 256Mi / limit 512Mi; steady state is much lower. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 19:54:13 +00:00
Viktor Barzin	96addf65b4	fix(anisette): docker.io/ image prefix to pass Kyverno require-trusted-registries Some checks failed ci/woodpecker/push/default Pipeline was canceled Details First apply was denied at admission — a bare dadoum/anisette-v3-server@sha256 ref isn't in the trusted-registries allowlist (only enumerated DockerHub user-repo prefixes are). docker.io/* IS allowlisted, so use the explicit registry prefix; still pulls via the 10.0.20.10 pull-through cache. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 19:47:05 +00:00
Viktor Barzin	0bfa6f0774	feat(anisette): self-hosted Apple anisette server for SideStore (infra #40 ) Some checks failed ci/woodpecker/push/default Pipeline failed Details Deploy a small stateless anisette-data server so the TripIt iOS Shell can be sideloaded with SideStore using a free Apple ID, without brokering the Apple-ID auth dance through a public third-party anisette server (which would see every login). SideStore points at a stable internal endpoint we control. - Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub releases / semver / sha tags), so pinned by manifest digest instead of a tag per the "never :latest" rule. Pulled from DockerHub via the registry-VM pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so a new upstream build prompts a digest re-pin. - Stateless: emptyDir backs the provisioning-library cache dir (regenerable download; upstream issue #23 means it doesn't preserve client auth across restarts anyway) — no PVC, no Vault secret. - Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none, allow_local_access_only, ssl_redirect off) — SideStore is a native client that can't do the Authentik cookie dance, same reasoning as android-emulator's adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never publicly exposed. Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 19:35:57 +00:00
Viktor Barzin	fe1f8d62e7	tripit: re-apply tripit stack to land CITY_IMAGE_PROVIDER=wikipedia All checks were successful ci/woodpecker/push/default Pipeline was successful Details The commit that enabled real city cover photos (`a69847a0`, CITY_IMAGE_PROVIDER=wikipedia, #47) was committed to master but its CI run skipped the tripit stack apply (changed-stack diff race — same class as the prior "re-apply after pipeline race" fixes). The env never landed in-cluster, so the provider stayed on its fake 1x1-PNG default and every trip/stay cover rendered blank/placeholder in prod. This comment touch forces CI to re-apply the tripit stack; terraform then reconciles the drift (desired HCL already has the env) so the deployment picks up CITY_IMAGE_PROVIDER=wikipedia. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 17:45:07 +00:00
Viktor Barzin	2df6ebf305	health: fix middleware ref namespace prefix (restore site from 404) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details My previous commit referenced the new limiter as `health-rate-limit@kubernetescrd`, omitting the namespace prefix. Traefik CRD middleware refs are `<namespace>-<name>@kubernetescrd`, and the Middleware lives in the `traefik` ns, so the router couldn't resolve it — Traefik failed the whole health.viktorbarzin.me router and returned 404 on every path (the app + pod were healthy throughout; verified via port-forward). Correct it to `traefik-health-rate-limit@kubernetescrd`, matching the working traefik-tripit-rate-limit / traefik-actualbudget-rate-limit references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 17:43:08 +00:00
Viktor Barzin	086ff85911	health: dedicated 100/1000 rate limit for the redesigned SPA Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor hit 429s browsing the redesigned health app. The default shared limiter is 10 req/s / burst 50, but each page load is the shell (JS chunks + two self-hosted Geist woff2) plus a 5-8 call API burst, so fast tab-to-tab navigation from one client IP overruns burst 50 — Traefik 429s the tail and the affected cards/pages render empty. Give health its own limiter (average 100, burst 1000) and skip the default, exactly as tripit/immich/actualbudget/ha-sofia already do for the same parallel-burst pattern. Attached via the ingress_factory escape hatch (skip_default_rate_limit + extra_middlewares). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 13:03:51 +00:00
Viktor Barzin	6dc77f4612	uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 09:11:22 +00:00
Viktor Barzin	05bec26d09	health: internal test-access ingress + DEV_AUTH_EMAIL (ADR-0008) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Add health-test.viktorbarzin.lan (auth=none, allow_local_access_only, anti-AI off) pointing at the same health deployment, plus a DEV_AUTH_EMAIL=vbarzin@gmail.com env on the container. Lets automated E2E / Playwright / manual screenshots reach the live app without the Authentik SSO redirect, for testing — while the public health.viktorbarzin.me ingress stays auth=required (forward-auth fails closed, so the public path always carries the real X-authentik-email header and never hits the DEV_AUTH_EMAIL fallback). LAN-only, no public exposure. Decision recorded in health repo ADR-0008. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 04:02:34 +00:00
Viktor Barzin	e6699ed20b	uptime-kuma: retry Kuma login in monitor-sync jobs (intermittent socket.io timeout) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The internal + external monitor-sync CronJobs intermittently failed with socketio.exceptions.TimeoutError on api.login(), firing JobFailed -> Slack noise (and leaving monitor sync stale). Kuma 2.3.2 itself is healthy (1/1, 30m CPU); its single Node event loop just briefly stalls under ~300 monitors so the socket.io login handshake occasionally exceeds the client timeout. Wrap connect+login in a 5-attempt / 15s-backoff retry (disconnecting the half-open client between tries) so a transient stall no longer fails the whole job. Applied to both sync scripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 20:54:14 +00:00
Viktor Barzin	a6381b8cf8	forgejo: custom 8Gi ResourceQuota (was pegged at the 4Gi tier cap) Some checks failed ci/woodpecker/push/default Pipeline failed Details Yesterday's Forgejo 3Gi->4Gi OOM fix pushed its tier-3-edge namespace quota (requests.memory=4Gi) to 100%, firing KubeQuotaAlmostFull + the healthcheck resourcequota check. Forgejo is the git + OCI-registry backbone and legitimately needs ~4Gi, so the edge tier's 4Gi ceiling is too tight. Opt the namespace out of the auto tier quota (resource-governance/custom-quota=true) and define a forgejo-specific ResourceQuota at requests.memory=8Gi, so the 4Gi pod sits at ~50% with headroom. Same opt-out pattern dbaas uses. Re-tiering was rejected: tier 1-cluster is also 4Gi, and 0-core (8Gi) would over-classify Forgejo's priority/eviction. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 17:16:47 +00:00
Viktor Barzin	72982683bc	docs(CLAUDE.md): k8s-portal now GHA->ghcr, not a Woodpecker build All checks were successful ci/woodpecker/push/default Pipeline was successful Details k8s-portal was the last in-cluster image builder. Its .woodpecker/k8s-portal.yml was deleted; it now builds on GHA (build-k8s-portal.yml) -> PRIVATE ghcr, pulled via the Kyverno ghcr-credentials allowlist and deployed by Keel. Fix the CI/CD section: drop k8s-portal from the Woodpecker-pipelines list (stale), move it from 'already on GHA' to the infra-owned private-ghcr images, and add it to the PRIVATE ghcr allowlist roster. Completes the no-local-builds migration. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 16:10:56 +00:00
Viktor Barzin	25a39fd54e	k8s-portal: wire private-ghcr pull (allowlist + imagePullSecrets) All checks were successful ci/woodpecker/push/default Pipeline was successful Details k8s-portal was the last in-cluster image build; it now builds on GHA and pushes ghcr.io/viktorbarzin/k8s-portal:latest, which is PRIVATE (infra repo default). To pull it: add k8s-portal to the sync-ghcr-credentials Kyverno allowlist (clones the ghcr-credentials Secret into the namespace) and reference that secret via imagePullSecrets on the deployment — same wiring as tripit/recruiter-responder. Completes the no-local-builds migration so nothing builds container images on the cluster anymore (ADR-0002). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 15:38:42 +00:00

1 2 3 4 5 ...

4339 commits