infra

Author	SHA1	Message	Date
Viktor Barzin	464e0bfb97	Revert "feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2)" All checks were successful ci/woodpecker/push/default Pipeline was successful Details This reverts commit `7513468a2d`.	2026-06-22 06:46:56 +00:00
Viktor Barzin	7513468a2d	feat(monitoring): Tempo + OTel Collector for tripit tracing (ADR-0032 Phase 2) Some checks failed ci/woodpecker/push/default Pipeline failed Details Stand up the cluster's first trace store + OTLP ingress so tripit's OpenTelemetry spans (Phase 1, already live in prod) export and correlate with logs: - Grafana Tempo (single-binary, filesystem on proxmox-lvm 20Gi, 30d) - OTel Collector (contrib; otlp -> redaction deny-list backstop -> batch -> tempo) - Grafana: a Tempo datasource + an ADDITIVE trace_id->Tempo derivedField on the Loki datasource (no uid change, so existing dashboards are unaffected) - tripit deployment: LOG_FORMAT=json + OTEL_EXPORTER_OTLP_ENDPOINT -> the Collector Additive (new helm releases; Loki/Prometheus/Grafana untouched). Offline 'terraform validate' clean; full plan+apply runs in CI (locked git-crypt blocks a local plan as non-admin). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 06:31:11 +00:00
Viktor Barzin	1a32c07ffe	docs(eso): Phase 1 done (0.16.2) + confirmed Phase 2 GC findings All checks were successful ci/woodpecker/push/default Pipeline was successful Details Execution log added to the ESO migration plan. Phase 1 complete: ESO at 0.16.2 (both v1beta1+v1 served). Phase 2 findings confirmed live: apiVersion bump forces a kubernetes_manifest REPLACE, and ESO ESs use creationPolicy=Owner (target Secret ownerRef → cascade-GC risk on the replace's delete). Phase 2 must snapshot Secrets + empirically validate GC-survival on the first live ES + per-stack two-phase -target apply (fallback: state rm + import). Corrected the doc's k8s assumption (cluster is on 1.34; whole climb stays on 1.34, no interleave). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 20:44:50 +00:00
Viktor Barzin	5e3fe2e8e2	docs(plans): ESO 0.12->2.6 (v1beta1->v1) migration design — the last k8s-1.35 blocker Design doc for migrating External Secrets Operator off v0.12 (k8s <=1.31), now the ONLY remaining compat-gate blocker for autonomous k8s 1.35 (kyverno cleared to 1.18.1 today). Decisive findings: NO v1beta1->v1 conversion webhook, so all 104 ExternalSecrets (across 73 stacks) + 2 ClusterSecretStores must be rewritten to external-secrets.io/v1 (byte-identical apiVersion bump) while on 0.16.2, BEFORE crossing 0.17 (which removes v1beta1 — the point of no return). Step one minor at a time (no skipping); chart==app version; downstream Secrets survive. 5-phase ordered plan + per-phase rollback + the plan-time data.kubernetes_secret -target gotcha (15 stacks) + Tier-0/SOPS handling. Plan only — nothing applied. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 17:27:37 +00:00
Viktor Barzin	ead876ec65	k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases All checks were successful ci/woodpecker/push/default Pipeline was successful Details Adds a daily visibility layer so every night's autonomous-upgrade outcome is reviewable at a glance during the upgrade-cleanup window (Viktor: "track every night's upgrade for the next 7 days; clean up all bugs and blockers"). Last night (2026-06-20) confirmed BOTH prior fixes work in production: the detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35. What's here: - CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning: running version, detector freshness, detected target, outcome (no-op / blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs. Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap. Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack). - K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.` to `k8s-upgrade-(preflight\|master\|worker\|postflight)-.` so the new report job (or any future helper) can't false-trip the chain-wedged alarm. Manual state repair (no git artifact): imported the orphaned `alert-digest` CronJob into the monitoring stack state (`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`). Root cause: when alert_digest was added (2026-06-12) the apply recorded its ConfigMap + Secret but not the CronJob, so every full monitoring apply since has failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline 298 today) — surviving only via targeted prometheus applies. Now in state, so monitoring CI applies cleanly again. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 16:57:44 +00:00
Viktor Barzin	7270e2be3b	monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block Some checks failed ci/woodpecker/push/default Pipeline failed Details Last night (2026-06-20) the detector + compat-gate fixes worked: the chain resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno 1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked fired as designed. But the refusal also made the preflight Job exit 1 (block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm for what is the intended halt-and-alert outcome. Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate block sets that gauge (and it stays 1 until the next preflight resets it), so the chain-job-failed alert is suppressed for the blocked period; a genuine wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires (preserving the alert's original purpose — catching the pre-in_flight preflight failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 16:35:35 +00:00
Viktor Barzin	ceae4d5f06	docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed) The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler never invoked) and has been removed. Document the replacement: in-kernel nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List + zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts. Both add zero per-request latency and fail open. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:39:26 +00:00
Viktor Barzin	68d9058f85	cleanup: fully remove orphaned council-complaints app The council-complaints app (Islington civic-reporting pilot) has been abandoned. It was already dead in the cluster (deployments scaled 0/0, image only on the decommissioned registry.viktorbarzin.me which 404s), and it was never in Terraform — only docs + a kyverno comment referenced it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses) were torn down out-of-band via kubectl (nothing in TF to drift from); the DB-dump PVC was backed up to NFS first. This removes the remaining repo references to the live app: - service-catalog.md: drop the council-complaints row - ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list - kyverno require-trusted-registries: the registry.viktorbarzin.me/* allowlist comment claimed council-complaints as the last referencer; rewrite it (no live workload pulls from that registry now; only stale completed Job records still carry the ref). The allowlist line itself is kept (registry-scoped, not app-specific). Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade- apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated repos (memory id=388)" snapshot; left as-is so the dated record stays accurate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 13:32:10 +00:00
Viktor Barzin	5a136c7d53	docs: t3-migrate-idle runbook section + service-catalog + design status Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:40:46 +00:00
Viktor Barzin	0cebeeb0ee	t3-idle-migrate: implementation plan Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:26:05 +00:00
Viktor Barzin	9503bed589	t3-idle-migrate: design for graceful overnight restart of deferred t3-serve instances Viktor hit the t3 'Client and server versions differ' warning. Root cause: the daily gated autoupdate defers a user's t3-serve restart whenever that user has an active agent at the 04:00 window, so anyone busy every night (long-lived/AFK sessions) never migrates and the client/server version skew persists for days. This design adds a small idle-gated overnight job that drains those deferrals -- restarting a deferred instance onto the current binary only when no turn is in flight (state.sqlite active_turn_id) and it's been quiet for a buffer, so the migration lands in a real quiet gap instead of killing in-flight agent turns. Reuses the autoupdate's proven backup->restart->verify->recover path via a shared helper (approach C from the brainstorm). Design doc only; no behavior change yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 12:04:22 +00:00
Viktor Barzin	b1bbe42821	homelab ha token: dedicated openclaw/ha-tokens secret + least-priv RBAC for emo Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details `ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only cluster admins can read — so it hung/failed for the non-admin operator it was built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose identity is deliberately barred from secrets in the openclaw namespace). Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london) with a Role + RoleBinding granting `get` on JUST that secret to the Home Server Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object). emo now resolves the HA token with their own identity, WITHOUT gaining the rest of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment keeps reading openclaw-secrets — purely additive. - stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding - cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse - README + ADR-0012 updated; VERSION -> v0.7.1 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-21 10:45:32 +00:00
Viktor Barzin	6d5d3726d6	Merge remote-tracking branch 'origin/master' into wizard/ha-cli-verbs Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details	2026-06-20 23:46:29 +00:00
Viktor Barzin	48225f2dea	homelab CLI v0.7: add `ha token` + `ha ssh` for Home Assistant Mined another devvm user's Claude sessions for repeated, hand-rolled command patterns worth absorbing into the shared CLI. The dominant signal was Home Assistant "Sofia" work: a `kubectl \| base64 \| jq` token-extraction pipeline re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented ~30x — every session. The existing `home-assistant-sofia.py` already covers the API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative path), so agents bypassed it and hand-rolled everything. Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control stays with the MCP): - `ha token [--instance sofia\|london]` (read): resolves the long-lived API token live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`. - `ha ssh [--instance sofia\|london] -- <cmd>` (write): deterministic non-interactive ssh to the HA host using the invoking user's key. Also fix the root cause: `home-assistant-sofia.py` now falls back to `homelab ha token` when its env var is unset (works from any directory), and the home-assistant skill points agents at these verbs + `homelab metrics query` instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the per-verb-group convention. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 23:46:09 +00:00
Viktor Barzin	46166c63b2	fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor's passkeys all vanished and he was suddenly being asked to log in multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE each device of users[0] — but the fuzzy search returned the REAL account, so it wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and the social-login stage (default-source-authentication-login) had the provider default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h — hence the constant re-logins. (Password + passkey logins were already weeks=4.) Changes: - authentik: adopt default-source-authentication-login into Terraform (import) and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long as password/passkey. Immediate relief without re-enrolling. - authentik: document the provider-schema gotcha — authentik_stage_identification exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in ignore_changes (commit `4e882989` removed them for this reason; re-adding breaks every apply). The passkey break was purely the missing device records, not drift. - edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login — carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule, make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block), and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting). - docs: record the session-duration change, the edge enforcement + auth carve-out (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert). Passkey re-enrollment is a manual user action (devices are gone from the DB); nothing auto-re-deletes them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 23:40:22 +00:00
Viktor Barzin	bc2fbc712c	Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew	2026-06-20 20:10:48 +00:00
Viktor Barzin	5549fc3672	Add per-user Claude auth renewal Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.	2026-06-20 20:10:40 +00:00
Viktor Barzin	3278588325	chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028) All checks were successful ci/woodpecker/push/default Pipeline was successful Details TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 20:04:24 +00:00
Viktor Barzin	b58fe8cb1a	docs(k8s-upgrade): record detector Packages-probe -L fix + compat-gate patch scope All checks were successful ci/woodpecker/push/default Pipeline was successful Details Two corrections to the runbook matching today's code fixes: - The next-minor patch probe (GET .../Packages) also needs `-L`; it lacked it until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes now follow the 302. - The compat gate's addon check is scoped to minor jumps — patches within the running minor are never addon-blocked (target_minor <= running_minor returns early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks a 1.34.x patch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:16:20 +00:00
Viktor Barzin	3e3fdb34f0	homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Answers the question that drove the whole CLI — which verbs to add next — with data instead of one maintainer's habits, and resolves the cross-user-usage ask in-bounds (no reading anyone's home). - emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} + "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery verbs (manifest/version/help) and usage itself don't self-record. - usage top [--since 30d] [--user U] [--json]: ranks verbs via sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving answer to "what does the team use". - Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no auth. ADR docs/adr/0011. Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 22:29:01 +00:00
viktor	78095aa273	docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub All checks were successful ci/woodpecker/push/default Pipeline was successful Details Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub auto-registration (zero-click sign-up) is on. Document why (global auto-reg + Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks account-linking) and how to re-enable Authentik later. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:37:46 +00:00
Viktor Barzin	a5bb4db9c5	crowdsec: register the Traefik bouncer with LAPI (fix fail-open) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The Traefik bouncer plugin's API key was never registered with LAPI — the crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and the chart registers no bouncer. So LAPI returned 403 to the plugin, which with updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was empty; the registration was likely lost in the MySQL->PostgreSQL DB migration with no IaC to recreate it. Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same Vault key the middleware presents — so they match by construction, and the bouncer re-registers automatically on every LAPI start (survives DB wipes). - stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module. - module main.tf: new sensitive var + thread into the values templatefile. - values.yaml: BOUNCER_KEY_traefik on lapi.env. - docs/architecture/security.md: document registration + fail-open history and the proxied-app coverage caveat. Activates enforcement (community blocklist bans + captcha) on non-proxied apps; internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:08:28 +00:00
Viktor Barzin	4a66377425	forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted people to be able to sign up with GitHub, not just the native form or Authentik SSO. - Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth --provider github` (name "github", matching the callback registered on the GitHub OAuth App). Like the existing Authentik source, it lives in Forgejo's DB rather than Terraform — there's no clean TF resource for login sources. Client id/secret mirrored to Vault secret/viktor (forgejo_github_oauth_client_id / _secret) for recovery. - This commit's TF change: ENABLE_AUTO_REGISTRATION=true in [oauth2_client], so a first GitHub sign-in creates the account directly ("sign up with GitHub") instead of a link-to-existing detour. The GitHub identity is the trust gate for this path; Turnstile + email confirmation still gate the native form. Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github redirects to GitHub's authorize URL with the correct client id + callback, and the login page renders the button. Final browser click-through is the user's to do. Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section + secret-rotation + DB-loss recreate steps). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:41:49 +00:00
Viktor Barzin	fd0c7493c3	traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse (http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files), but the Traefik bouncer plugin had no captcha provider configured — so those decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go @ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had no way to self-unblock, contradicting the profile's stated intent. Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha decision now renders a solvable challenge instead of a hard block: - New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to viktorbarzin.me so one widget covers every subdomain the bouncer fronts. Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are passed into the traefik module. - middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s + captchaHTMLFilePath=/captcha/captcha.html. - Vendor the plugin's captcha.html and mount it into the Traefik container at /captcha via the chart `volumes` value — the pulled Yaegi plugin does not expose its bundled template to Traefik. - docs/architecture/security.md: document the ban-vs-captcha remediation split. - Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with placeholder reCAPTCHA keys; referenced by zero .tf). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:38:38 +00:00
Viktor Barzin	963e4fcdde	forgejo: open native self-signups, gated by Turnstile + email confirmation All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wants Forgejo open for anyone to sign up, but without bot/spam account floods. Flip the deployment from OAuth-only registration (ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local sign-up, and add two bot gates on the registration form: - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget is managed in Terraform (turnstile.tf) via the CF Global API key, so the sitekey/secret are IaC, not a dashboard artifact. - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced credential Authentik uses (email-secret.tf ESO -> secret/authentik smtp_password). Existing Authentik OAuth2 login is unchanged (additive). Deployment env appended (not inserted) so the diff stays purely additive; a reloader annotation rolls the pod on secret rotation. Verified live: signup page renders the Turnstile widget, mailer delivers a test message end-to-end, Forgejo healthy, plan-to-zero after apply. Runbook: docs/runbooks/forgejo-open-signups.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:05:07 +00:00
Viktor Barzin	21dbd79ae4	Merge remote-tracking branch 'origin/master' into wizard/homelab-obs Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details	2026-06-19 11:27:44 +00:00
Viktor Barzin	e91e1612dd	homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution) The remaining verbs that pass the "saves reasoning, not just typing" test the user posed mid-session: each encodes the non-obvious which-endpoint-reached-how resolution otherwise re-derived every time. (Same test deprioritized node-ssh and secret-get aliasing — thin wrappers over commands already known.) - net check <host> [path]: two-legged reachability — external (public DNS→CF) vs internal (Traefik LB) — so you see WHERE a break is, not just that one path works. (live: surfaced the LB at 6ms vs CF 77ms.) - dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff. - metrics query "<promql>" / metrics alerts: Prometheus via the LB (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series since the query frontend has no /api/v1/alerts and Alertmanager has no ingress. - logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB. All reach auth-free internal ingresses through the LB (Go form of curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster- only endpoints (Alertmanager v2) deliberately out of scope. Verified live before building; all five smoke-tested green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:27:31 +00:00
Viktor Barzin	6cb823e431	k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt + alert when not": - monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning) in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see Slack for why" signal. (Until monitoring is applied, a block still surfaces via the already-live K8sUpgradeChainJobFailed.) - upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests — apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns) Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't downgrade). Catches a "pods look Running but cluster is broken" upgrade. - runbook: documents the compat gate, the blocked alert, how to clear a block, matrix maintenance, and the detector minor-probe fix. After deploy, the nightly chain detects 1.35 (minor detection now works) and correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting via K8sUpgradeBlocked — the autonomy working as designed until the catch-up clears those addons.	2026-06-19 11:27:17 +00:00
Viktor Barzin	9189560ac3	homelab: v0.4.0 — ci/deploy verbs (watch what you trigger) Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Adds the verb-group that kills the single biggest reasoning sink in agent sessions — watching a build/deploy to completion (proven the session that built it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI incident). - ci status/watch: Woodpecker REST API (version-stable, not its DB schema), reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me so the cert verifies — the Go form of the house `curl --resolve` pattern), token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with retries that ride Woodpecker's intermittent empty responses. watch matches the HEAD/given commit (avoids the post-push race) and exits non-zero on failure. - deploy wait: image-sha match THEN rollout status (rollout status alone returns success on the old ReplicaSet); kubectl-based. - work land now auto-watches CI to green on the landed commit (--no-ci-watch to skip), closing the v0.1 gap. - ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least reliable; status/watch use the working list endpoint). Live-verified ci status/watch against the live API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 10:59:14 +00:00
Viktor Barzin	fd77c0dc4f	monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot Some checks failed ci/woodpecker/push/default Pipeline failed Details The rpi-sofia under-voltage alert keyed off the sticky firmware bit (rpi_under_voltage_occurred == 1), which latches on the first brown-out and stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a few of these lately" — and it disagreed with the HA-sofia dashboard, which shows the live state and reads OK once voltage recovers. Can't just switch to the live bit: rpi_under_voltage_now never registered once in 14d (brown-outs are sub-second and fall between the 1-min textfile-collector samples), so the sticky bit is the only reliable detector. Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0. Fires once per brown-out and auto-resolves ~1h later (~2h active over the same 14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both real brown-out events in the window are still caught. Docs updated in the same commit (monitoring.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:45:39 +00:00
Viktor Barzin	077ac97df5	k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps Some checks failed ci/woodpecker/push/default Pipeline failed Details kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the k8s dashboard) until someone manually re-applied the rbac stack. That manual step ran after every control-plane upgrade — the one thing keeping autonomous patch upgrades from being truly hands-off (it bit us this cycle: an earlier master bump left SSO broken until we noticed). Automate it: the rbac stack now publishes its existing OIDC restore script (the same one its null_resource runs) to a kube-system/apiserver-oidc-restore ConfigMap, and the upgrade chain's phase_master re-runs it on master right after the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add apiserver restart can't crashloop it. The script is idempotent and health-gates /livez with auto-rollback; the step is non-fatal (a failure only lags SSO until the next rbac apply, it won't abort the upgrade). phase_master already self-skips when master is at target, so this only fires when master was actually upgraded. The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the manual restore is now a documented fallback (command corrected — it needs -replace, since the null_resource trigger hash never changes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 06:04:30 +00:00
Viktor Barzin	48b63ffa6f	homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client Some checks failed Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline failed Details Lets agents search/navigate memory via the CLI, as the first step toward deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just one frontend); homelab memory is a thin Bearer-auth HTTP client over the same API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works even when the MCP frontend is down — the recurring disconnect that took the MCP offline for this whole session. Verbs: recall (server-side semantic search), list, categories, tags, stats, secret (read); store, update, delete (write). Validated against the live API including a store→recall→delete round-trip — full data-plane parity with the MCP. The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after the CLI is proven in the hooks — see docs/adr/0008. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 05:56:25 +00:00
Viktor Barzin	3594485f77	homelab: v0.2.0 — docs + version for the k8s verb-group Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver note), add docs/adr/0007 (resolver, read/write split, config-mutation stays raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the Kubernetes surface. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 22:30:41 +00:00
Viktor Barzin	66caa0bf7f	homelab: v0.1 docs, distribution wiring, and version Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Completes v0.1: documentation, build/install path, and version stamping. - cli/VERSION (v0.1.0) stamped into the binary via ldflags. - cli/README.md rewritten as the homelab overview (verbs + tiers, manifest, build, the preserved legacy webhook use-cases). - docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the work/tf behaviour (native worktree entry, verification-gated auto-land, presence-coupled apply). - setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run (t3-dispatch pattern), so every devvm user gets the current binary. - AGENTS.md: discovery pointer under Common Operations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:25:51 +00:00
Viktor Barzin	70e217db24	k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target All checks were successful ci/woodpecker/push/default Pipeline was successful Details The autonomous 1.34.9 version-upgrade chain has been failing its preflight every night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on 1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line, so the parsed target came back empty and the `!= requested` check aborted the whole chain before any worker was touched. Deterministic — it self-cleaned and re-failed identically each night, so it would have failed again tonight, leaving node2-6 stuck on the old patch. Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION — the same at-target self-skip that phase_master and phase_worker already do. The remaining workers are still validated by their own per-node phases, and the detector already confirmed the target is installable via apt-cache. This lets tonight's unattended chain resume and finish node2-6 -> 1.34.9. Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:17:46 +00:00
Viktor Barzin	c04efa3d3a	k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades) Some checks failed ci/woodpecker/push/default Pipeline failed Details Disruptive node drains should run when the cluster is idle. Move the k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC (00:00 London) — overnight, low usage, and clear of the kured OS-reboot window (01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.) - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *. - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot (was next_daily_noon_utc). - docs (runbook, architecture) + upgrade-state SKILL: schedule references updated to 23:00 UTC nightly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:16:32 +00:00
Viktor Barzin	ed53b34bf4	k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS records, so the chain couldn't SSH to them at all. Refactor (upgrade-step.sh): - Worker set + order derived live from `kubectl get nodes` (worker_nodes / next_pending_worker), so EVERY worker still off-target is upgraded and a newly-joined node is covered with zero script change. - SSH targets are node InternalIPs (ssh_target), removing the dependency on node DNS records entirely — a new node is reachable the moment it joins. - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now enumerate workers/all-nodes dynamically too. - Topology preserved: master-drain Job runs on the first worker; every worker-drain Job runs on the already-upgraded k8s-master (self-preemption invariant intact). - next_pending_worker returns 0 explicitly on the no-match path — the `while read … done < <(…)` loop exits 1 at EOF, which under set -e would abort the LAST worker's Job before it spawns postflight (cluster upgraded but no cleanup / in_flight reset). Caught in review. Docs (runbook + architecture + headers) updated to the dynamic topology. NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was deployed to node4/5/6 by hand this session. Baking it into node provisioning (so new nodes get it automatically) is the remaining follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 16:56:02 +00:00
Viktor Barzin	037a609f27	k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's bundled corefile-migration table ("start version not supported"). - scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite our custom split-horizon Corefile with kubeadm's default AND downgrade the image; --skip-phases leaves CoreDNS 100% untouched while the control plane upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift. - stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight quiet-baseline (settle-window) check, which silently no-op'd on the ghcr claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open). - docs: runbook + architecture document the CoreDNS handling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:45:05 +00:00
Viktor Barzin	fb638cd8ec	k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs Some checks failed ci/woodpecker/push/default Pipeline failed Details Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to the terminal job-condition reasons (BackoffLimitExceeded\|DeadlineExceeded). A phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every firing alert also halts kured, so a bare-count false-positive would block all OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics: the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0 for the terminal reasons. Docs updated to match the behaviour change (per the same-commit docs rule): - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the "kill a stuck Job" recovery now leads with retry-on-failure self-heal. - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert; retry-on-failure note on the deterministic-naming paragraph. - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend entry, and drill-down (also copied to the active ~/.claude copy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:10:18 +00:00
Viktor Barzin	8a2a3d9eca	Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details # Conflicts: # scripts/t3-provision-users.sh	2026-06-16 22:32:43 +00:00
Viktor Barzin	63e714782c	immich: remove one-shot anca-elements-import Job + its PVC All of Anca's photos are imported. The Job was declared as kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of the immich stack re-created it, despite the 2026-05-25 in-code comment saying "After successful completion: REMOVE this resource block + apply again." Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the 6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver. Recovery actions taken before this commit: - Throttled nfsd 64→8 on PVE host to give apiserver headroom - `kubectl delete job -n immich anca-elements-import` + force-delete pod - Restored nfsd to 64; cluster healthy Code change here: - Remove `kubernetes_job_v1.anca_elements_import` block - Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` — no live consumer; videos batch deferred per user, source dump remains on PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin) - Update 2026-05-25 post-mortem: 6th-incident section + new lesson that one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob or a runbook-captured `kubectl create job` ad-hoc invocation instead).	2026-06-16 22:11:27 +00:00
Viktor Barzin	0a6ed4b2fe	workstation: per-user playwright browser MCP for all users, reproducible from git Viktor asked that the playwright browser MCP be available for every devvm user in every directory, with each user running their own server and multiple concurrent sessions per user. Before this, playwright was hand-set-up per user (~/.config/systemd/user/ playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired — emo's and anca's servers ran but their ~/.claude.json had no playwright entry, so their Claude never connected. None of it was reproducible from git (units, refresh script, and the Vault snapshot token lived only in user homes), so a devvm rebuild would silently lose it. This makes it reproducible and fixes the unwired users: - roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931, allocated for every roster user incl. the admin), emitted in the derive JSON. - scripts/workstation/playwright/: system-level TEMPLATE units (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer}, User=%i — system manager, so no systemd --user / linger) + the refresh script. @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll footgun, same rationale as T3_PIN). - setup-devvm.sh: install the templates + script (9e); stage the chrome-service snapshot bearer token from Vault to a root file (8c) — the hourly root reconcile has no Vault token, mirrors the Claude OAuth staging in 8a. - t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and enable --now's the instances (idempotent, never restarts a running server). Also hardened the section-1 .env scan to skip the new playwright-.env files (no T3_PORT -> grep no-match would abort under set -e -o pipefail). - Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3. Supersedes the hand-made per-user --user units (one-time idle-gated migration to follow on the live host). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:33:47 +00:00
Viktor Barzin	cdd9ecd199	t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog) All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details Phase 4 docs for the enforcer -> gated-tracker change: - runbook t3-version-bump.md: rewritten around the tracker — how each bump is gated, plus freeze/revert/pin/dry-run/manual-rollback ops. - post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the gates close each named root-cause/lesson (historical sections left intact). - service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker; replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy 2026-06-16, cookieless -> 302 + t3_session). - t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:33:49 +00:00
Emil Barzin	1ba453c65d	fan-control docs: sync runbook/env/service/design to the HA-actuator + anti-flap model All checks were successful ci/woodpecker/push/default Pipeline was successful Details The committed docs still described the 2026-06-04 presence-aware daemon. Bring them in line with what is actually deployed: HA computes the setpoint, the host is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias, anti-flap hold-last, and the new HA readout sensors (command/equilibrium/ cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone lost in the workstation reshuffle; re-created here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:11:48 +00:00
Viktor Barzin	cbca281aaa	feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020) Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 21:48:04 +00:00
Viktor Barzin	cf51cb45de	docs(adr-0003): keep Forgejo canonical, complete the GitHub mirror (reject swap) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Grilled the 'swap Forgejo for GitHub' idea. Root cause of the divergence pain is an incomplete push-mirror rollout (14 repos dual-pushed, push_mirrors=0), not Forgejo itself — and CONTEXT.md already documents Forgejo-canonical + one-way GitHub mirror. Decision: don't swap; finish the mirror, name the GitHub-first exceptions, reconcile infra, enforce one-remote-per-clone. Adds ADR-0003 + the GitHub-first repo glossary term + dual-push/force-overwrite warnings on Canonical repo / GitHub mirror. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-15 21:32:28 +00:00
Viktor Barzin	92c5b24975	docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Minted a dedicated classic GitHub PAT scoped to read:packages and stored it in Vault secret/viktor/ghcr_pull_token (2026-06-15), replacing the previous alias of the broad admin github_pat. Propagated via targeted apply of module.kyverno.kubernetes_secret.ghcr_credentials (Kyverno re-syncs the allowlisted namespaces). Document the new cred + the manual rotation recipe. Closes: code-h2il Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-15 20:19:17 +00:00
Viktor Barzin	ef555c7e02	workstation: put ~/.local/bin on PATH so the launcher finds native claude All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in tmux's NON-login bash env, which doesn't source the user's shell rc where the native installer put ~/.local/bin on PATH. So `command -v claude` failed there → the launcher's bootstrap re-ran the native installer → the installer printed the PATH warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit, and t3-serve sets PATH in its unit — so only the terminal launcher was affected.) - skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent), before the launch logic — so `claude` is found, no reinstall, no warning. - setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile. - docs/architecture/multi-tenancy.md: documented the three PATH-injection points. Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:20:03 +00:00
Viktor Barzin	eecd78233b	workstation: standardize on the native claude install (drop npm-global + npx) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Question from Viktor: should claude run via the binary or npx? Answer: the native install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude; installMethod=native) — and every existing user had already auto-migrated to it, leaving the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup": - setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and just shadowed the per-user native installs). - t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap. - skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via the native installer (was an `npx @anthropic-ai/claude-code` fallback). - docs/architecture/multi-tenancy.md: documented the native-only runtime model. node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable + produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:12:05 +00:00
Viktor Barzin	bb3f5f2329	workstation: stop the Claude Code onboarding wizard reappearing for terminal users All checks were successful ci/woodpecker/push/default Pipeline was successful Details emo reported being "logged out" on terminal.viktorbarzin.me: every new shell dropped him at the first-run "Choose the text style" wizard, even though he'd used many sessions and is in fact fully authenticated. Root cause is NOT a logout — ~/.claude.json is a single file that all of a user's concurrent claude processes (the ttyd terminal + their t3-serve instance + agent sessions) read-modify-write, and a stale writer periodically drops top-level keys, including hasCompletedOnboarding. That bounces the next interactive session back to onboarding; credentials are safe in the separate ~/.claude/.credentials.json (which is why T3 kept working). wizard's own ~/.claude.json showed the same key loss, so this hits any heavy multi-session user. Fix: - skel/start-claude.sh: ensure_onboarding() idempotently re-asserts hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before launching claude. Merge-only (never clobbers other keys), runs as the user, and no-ops if jq is missing or the file is empty/corrupt. So even if the race drops the flag, the next launch restores it before claude reads it. - t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel only seeds the launcher at account creation, so without this the fix (and any future launcher edit) would never reach existing users. .tmux.conf is deliberately not re-copied — terminal-lobby appends a managed section to it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:37:59 +00:00

1 2 3 4 5 ...

378 commits