Commit graph

4319 commits

Author SHA1 Message Date
Viktor Barzin
0a6ed4b2fe workstation: per-user playwright browser MCP for all users, reproducible from git
Viktor asked that the playwright browser MCP be available for every devvm user
in every directory, with each user running their own server and multiple
concurrent sessions per user.

Before this, playwright was hand-set-up per user (~/.config/systemd/user/
playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired —
emo's and anca's servers ran but their ~/.claude.json had no playwright entry,
so their Claude never connected. None of it was reproducible from git (units,
refresh script, and the Vault snapshot token lived only in user homes), so a
devvm rebuild would silently lose it.

This makes it reproducible and fixes the unwired users:

- roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931,
  allocated for every roster user incl. the admin), emitted in the derive JSON.
- scripts/workstation/playwright/: system-level TEMPLATE units
  (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer},
  User=%i — system manager, so no systemd --user / linger) + the refresh script.
  @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll
  footgun, same rationale as T3_PIN).
- setup-devvm.sh: install the templates + script (9e); stage the chrome-service
  snapshot bearer token from Vault to a root file (8c) — the hourly root
  reconcile has no Vault token, mirrors the Claude OAuth staging in 8a.
- t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes
  PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json
  by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes
  existing/new/admin without rewriting a populated config), and enable --now's the
  instances (idempotent, never restarts a running server). Also hardened the
  section-1 *.env scan to skip the new playwright-*.env files (no T3_PORT -> grep
  no-match would abort under set -e -o pipefail).
- Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit
  commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3.

Supersedes the hand-made per-user --user units (one-time idle-gated migration to
follow on the live host).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:33:47 +00:00
github-actions[bot]
eb47eb1d10 priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-16 17:45:33 +00:00
github-actions[bot]
d1f2e50736 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-06-16 17:44:40 +00:00
github-actions[bot]
46b5f04f67 priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-16 17:20:08 +00:00
github-actions[bot]
29ad200026 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-06-16 17:19:55 +00:00
Viktor Barzin
57d45d8d8f fix(authentik): pin Vault binding UUIDs as literals (provider has no authentik_application data source)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
CI pipeline 198 failed: the pinned goauthentik/authentik provider has no data "authentik_application" source, so terraform failed the whole authentik plan and applied NOTHING (state unchanged). Replace the data-source lookups with the live pbm_uuid (Vault app) and group_uuid (Allow Login Users) as literals; authentik_policy_binding is supported (used in guest.tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:01:29 +00:00
Viktor Barzin
aa461b95bc feat(authentik): bind Vault OIDC app to Allow Login Users (close ADR-0020 OIDC gap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Audit found the Vault Authentik application had no authorization binding, so any authenticated identity (incl. a future self-enrolled TripIt External user) could complete Vault OIDC login and get a built-in default-policy token. Bind it to 'Allow Login Users' — existing homelab users inherit that group via its children (verified User.all_groups() includes the parent), parentless TripIt External users are excluded. Closes the only OIDC app the forward-auth fence does not cover.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 21:48:04 +00:00
Viktor Barzin
cbca281aaa feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020)
Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 21:48:04 +00:00
Viktor Barzin
cf51cb45de docs(adr-0003): keep Forgejo canonical, complete the GitHub mirror (reject swap)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Grilled the 'swap Forgejo for GitHub' idea. Root cause of the divergence pain
is an incomplete push-mirror rollout (14 repos dual-pushed, push_mirrors=0),
not Forgejo itself — and CONTEXT.md already documents Forgejo-canonical +
one-way GitHub mirror. Decision: don't swap; finish the mirror, name the
GitHub-first exceptions, reconcile infra, enforce one-remote-per-clone. Adds
ADR-0003 + the GitHub-first repo glossary term + dual-push/force-overwrite
warnings on Canonical repo / GitHub mirror.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 21:32:28 +00:00
Viktor Barzin
5d3a166b94 t3-afk: fix agent Bash — stop mounting into ~/.claude
Some checks failed
ci/woodpecker/push/default Pipeline failed
Root cause of "the agent never commits": the issue-implementer CLAUDE.md was
subPath-mounted at /home/node/.claude/CLAUDE.md, which made /home/node/.claude
root-owned. The agent (uid 1000) then couldn't create its Bash session-env
there, so EVERY Bash/git call failed (Write/Edit worked, so it silently edited
but never committed). Found by reading the agent transcripts from
state.sqlite -> projection_thread_messages.

Fix: don't mount anything into ~/.claude (it's not honored by T3's SDK anyway).
Behaviour is injected via the dispatch message preamble by the control plane;
files/issue-implementer-CLAUDE.md kept as the canonical source text.

Verified post-fix: a preamble-dispatched task edited README and COMMITTED
(073ab28) unattended.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:49:34 +00:00
Viktor Barzin
34c30ac2bf t3-afk: auto-pair dispatcher sidecar — no manual pairing
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The bare `t3 serve` behind Authentik showed the manual /pair#token screen, which
didn't connect. Mirror the devvm t3-dispatch: a small stdlib-Node sidecar fronts
t3 serve, and on a cookieless (already Authentik-gated) document load it mints a
pairing credential (`t3 auth pairing create`) and exchanges it at
/api/auth/browser-session for the t3_session cookie, then 302s back. Everything
else — including WebSocket upgrades for the live cockpit — reverse-proxies to
:3773. The Service now targets the sidecar (:8080).

Verified: cookieless GET -> 302 + Set-Cookie t3_session; cookied GET -> 200 SPA.
Matches the t3.viktorbarzin.me experience (Authentik login -> straight into the
cockpit).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:19:39 +00:00
Viktor Barzin
92c5b24975 docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Minted a dedicated classic GitHub PAT scoped to read:packages and stored it in
Vault secret/viktor/ghcr_pull_token (2026-06-15), replacing the previous alias
of the broad admin github_pat. Propagated via targeted apply of
module.kyverno.kubernetes_secret.ghcr_credentials (Kyverno re-syncs the
allowlisted namespaces). Document the new cred + the manual rotation recipe.

Closes: code-h2il

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 20:19:17 +00:00
Viktor Barzin
ef555c7e02 workstation: put ~/.local/bin on PATH so the launcher finds native claude
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude
binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in
tmux's NON-login bash env, which doesn't source the user's shell rc where the native
installer put ~/.local/bin on PATH. So `command -v claude` failed there → the
launcher's bootstrap re-ran the native installer → the installer printed the PATH
warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit,
and t3-serve sets PATH in its unit — so only the terminal launcher was affected.)

- skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent),
  before the launch logic — so `claude` is found, no reinstall, no warning.
- setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for
  all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit
  (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile.
- docs/architecture/multi-tenancy.md: documented the three PATH-injection points.

Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:20:03 +00:00
Viktor Barzin
eecd78233b workstation: standardize on the native claude install (drop npm-global + npx)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Question from Viktor: should claude run via the binary or npx? Answer: the native
install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude;
installMethod=native) — and every existing user had already auto-migrated to it, leaving
the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup":

- setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide
  `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and
  just shadowed the per-user native installs).
- t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official
  https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native
  claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap.
- skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via
  the native installer (was an `npx @anthropic-ai/claude-code` fallback).
- docs/architecture/multi-tenancy.md: documented the native-only runtime model.

node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable +
produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:12:05 +00:00
Viktor Barzin
4a48f065e9 mcp: drop project-scoped paperless from .mcp.json (paperless is now wizard-only)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Paperless is a personal tool for wizard, not shared. It was project-scoped in the
infra repo's .mcp.json (the in-cluster paperless-mcp proxy), so every user whose
~/code IS an infra clone (emo, ancamilea) auto-loaded it. Per request, paperless
should be wizard-only: wizard now runs his own direct, token-based paperless MCP in
his user-scope config (a local barryw/paperlessmcp container -> paperless-ngx).
Removing the shared entry so emo and other infra-clone users no longer get it; the
`ha` MCP stays project-scoped. emo's clone drops it on next freshen.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:03:37 +00:00
Viktor Barzin
bb3f5f2329 workstation: stop the Claude Code onboarding wizard reappearing for terminal users
All checks were successful
ci/woodpecker/push/default Pipeline was successful
emo reported being "logged out" on terminal.viktorbarzin.me: every new shell
dropped him at the first-run "Choose the text style" wizard, even though he'd
used many sessions and is in fact fully authenticated. Root cause is NOT a
logout — ~/.claude.json is a single file that all of a user's concurrent claude
processes (the ttyd terminal + their t3-serve instance + agent sessions)
read-modify-write, and a stale writer periodically drops top-level keys,
including hasCompletedOnboarding. That bounces the next interactive session back
to onboarding; credentials are safe in the separate ~/.claude/.credentials.json
(which is why T3 kept working). wizard's own ~/.claude.json showed the same key
loss, so this hits any heavy multi-session user.

Fix:
- skel/start-claude.sh: ensure_onboarding() idempotently re-asserts
  hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before
  launching claude. Merge-only (never clobbers other keys), runs as the user, and
  no-ops if jq is missing or the file is empty/corrupt. So even if the race drops
  the flag, the next launch restores it before claude reads it.
- t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh
  into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel
  only seeds the launcher at account creation, so without this the fix (and any
  future launcher edit) would never reach existing users. .tmux.conf is
  deliberately not re-copied — terminal-lobby appends a managed section to it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:37:59 +00:00
Viktor Barzin
82a0c5aedf t3-afk: fix crashloop — exclude from Keel at the deployment level
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Keel "patch"-downgraded the image docker.io/library/node:24 -> library/node:24.0.2,
which is below t3@0.0.27's required node >=24.10, so `t3 serve` exited silently and
the pod crash-looped (~160 restarts / 13h).

Root cause: keel.sh/policy=never was on the POD-TEMPLATE labels, but Keel reads the
policy at the DEPLOYMENT level. The cluster's Kyverno inject-keel-annotations is
opt-out, so it stamped policy=patch and Keel acted on it.

Fix: set keel.sh/policy=never as a deployment-level annotation; ignore_changes the
Kyverno-injected keel.sh/pollSchedule + keel.sh/trigger annotations; the image stays
TF-owned (apply reverted Keel's downgrade). Pod now 1/1, t3 serve 200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 10:32:38 +00:00
Viktor Barzin
214638216b fix(anisette): wait_for_rollout=false so a slow first start can't strand the deploy out of state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The docker.io fix created the deployment, but wait_for_rollout (default true)
then hung on the OOMing pod and the apply failed — leaving the deployment in
the cluster but NOT in terraform state, so every later apply hit
'deployments.apps "anisette" already exists'. Deleted that orphan and set
wait_for_rollout=false (mirrors tts/llama-cpp slow-start services); readiness
probe still gates Service traffic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:56:30 +00:00
Viktor Barzin
d8c60d7ab8 t3-afk: dedicated in-cluster T3 Code instance (AFK executor + cockpit)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Slice #2 of claude-agent-service PRD #1 (AFK implementation pipeline). Dedicated
in-cluster T3 Code instance the control plane dispatches issues into; runs the
issue-implementer agent in a git worktree with a live cockpit. Applied + live
2026-06-14 (9 resources).

Pilot-fast: stock docker.io/library/node:24 + install pinned t3@0.0.27 + Claude
CLI at startup onto an SSD-NFS PVC. Authentik-gated ingress. issue-implementer
behaviour ships as a user-level ~/.claude/CLAUDE.md (T3 hardcodes the system
prompt; settingSources loads it) and forbids plan-mode/clarifying-questions so
unattended threads don't stall. Keel-excluded (ADR 0003). wait_for_rollout=false
(slow first start). Image fully-qualified for the Kyverno trusted-registries
allowlist; container mem limit 4Gi (tier-aux LimitRange cap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:06:33 +00:00
Viktor Barzin
bc7b28244f fix(anisette): raise memory limit to 512Mi — 128Mi OOMKilled at startup
Some checks failed
ci/woodpecker/push/default Pipeline failed
The pod CrashLooped with OOMKilled (exit 137): anisette downloads and
initializes Apple's CoreADI provisioning library on startup, spiking past the
128Mi limit before it can bind :6969 (empty logs, liveness 'connection
refused'). Bump request 256Mi / limit 512Mi; steady state is much lower.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:54:13 +00:00
Viktor Barzin
96addf65b4 fix(anisette): docker.io/ image prefix to pass Kyverno require-trusted-registries
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
First apply was denied at admission — a bare dadoum/anisette-v3-server@sha256
ref isn't in the trusted-registries allowlist (only enumerated DockerHub
user-repo prefixes are). docker.io/* IS allowlisted, so use the explicit
registry prefix; still pulls via the 10.0.20.10 pull-through cache.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:47:05 +00:00
Viktor Barzin
0bfa6f0774 feat(anisette): self-hosted Apple anisette server for SideStore (infra #40)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Deploy a small stateless anisette-data server so the TripIt iOS Shell can be
sideloaded with SideStore using a free Apple ID, without brokering the
Apple-ID auth dance through a public third-party anisette server (which would
see every login). SideStore points at a stable internal endpoint we control.

- Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server
  for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub
  releases / semver / sha tags), so pinned by manifest digest instead of a tag
  per the "never :latest" rule. Pulled from DockerHub via the registry-VM
  pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so
  a new upstream build prompts a digest re-pin.
- Stateless: emptyDir backs the provisioning-library cache dir (regenerable
  download; upstream issue #23 means it doesn't preserve client auth across
  restarts anyway) — no PVC, no Vault secret.
- Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none,
  allow_local_access_only, ssl_redirect off) — SideStore is a native client
  that can't do the Authentik cookie dance, same reasoning as android-emulator's
  adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never
  publicly exposed.

Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service
catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:35:57 +00:00
Viktor Barzin
fe1f8d62e7 tripit: re-apply tripit stack to land CITY_IMAGE_PROVIDER=wikipedia
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The commit that enabled real city cover photos (a69847a0,
CITY_IMAGE_PROVIDER=wikipedia, #47) was committed to master but its CI run
skipped the tripit stack apply (changed-stack diff race — same class as the
prior "re-apply after pipeline race" fixes). The env never landed in-cluster,
so the provider stayed on its fake 1x1-PNG default and every trip/stay cover
rendered blank/placeholder in prod. This comment touch forces CI to re-apply
the tripit stack; terraform then reconciles the drift (desired HCL already
has the env) so the deployment picks up CITY_IMAGE_PROVIDER=wikipedia.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:45:07 +00:00
Viktor Barzin
2df6ebf305 health: fix middleware ref namespace prefix (restore site from 404)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
My previous commit referenced the new limiter as `health-rate-limit@kubernetescrd`,
omitting the namespace prefix. Traefik CRD middleware refs are
`<namespace>-<name>@kubernetescrd`, and the Middleware lives in the `traefik` ns,
so the router couldn't resolve it — Traefik failed the whole
health.viktorbarzin.me router and returned 404 on every path (the app + pod were
healthy throughout; verified via port-forward).

Correct it to `traefik-health-rate-limit@kubernetescrd`, matching the working
traefik-tripit-rate-limit / traefik-actualbudget-rate-limit references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:43:08 +00:00
Viktor Barzin
086ff85911 health: dedicated 100/1000 rate limit for the redesigned SPA
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor hit 429s browsing the redesigned health app. The default shared limiter
is 10 req/s / burst 50, but each page load is the shell (JS chunks + two
self-hosted Geist woff2) plus a 5-8 call API burst, so fast tab-to-tab
navigation from one client IP overruns burst 50 — Traefik 429s the tail and the
affected cards/pages render empty.

Give health its own limiter (average 100, burst 1000) and skip the default,
exactly as tripit/immich/actualbudget/ha-sofia already do for the same
parallel-burst pattern. Attached via the ingress_factory escape hatch
(skip_default_rate_limit + extra_middlewares).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:03:51 +00:00
Viktor Barzin
6dc77f4612 uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 09:11:22 +00:00
Viktor Barzin
05bec26d09 health: internal test-access ingress + DEV_AUTH_EMAIL (ADR-0008)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add health-test.viktorbarzin.lan (auth=none, allow_local_access_only,
anti-AI off) pointing at the same health deployment, plus a
DEV_AUTH_EMAIL=vbarzin@gmail.com env on the container. Lets automated
E2E / Playwright / manual screenshots reach the live app without the
Authentik SSO redirect, for testing — while the public
health.viktorbarzin.me ingress stays auth=required (forward-auth fails
closed, so the public path always carries the real X-authentik-email
header and never hits the DEV_AUTH_EMAIL fallback). LAN-only, no public
exposure. Decision recorded in health repo ADR-0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 04:02:34 +00:00
Viktor Barzin
e6699ed20b uptime-kuma: retry Kuma login in monitor-sync jobs (intermittent socket.io timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal + external monitor-sync CronJobs intermittently failed with socketio.exceptions.TimeoutError on api.login(), firing JobFailed -> Slack noise (and leaving monitor sync stale). Kuma 2.3.2 itself is healthy (1/1, 30m CPU); its single Node event loop just briefly stalls under ~300 monitors so the socket.io login handshake occasionally exceeds the client timeout. Wrap connect+login in a 5-attempt / 15s-backoff retry (disconnecting the half-open client between tries) so a transient stall no longer fails the whole job. Applied to both sync scripts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 20:54:14 +00:00
Viktor Barzin
a6381b8cf8 forgejo: custom 8Gi ResourceQuota (was pegged at the 4Gi tier cap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Yesterday's Forgejo 3Gi->4Gi OOM fix pushed its tier-3-edge namespace quota (requests.memory=4Gi) to 100%, firing KubeQuotaAlmostFull + the healthcheck resourcequota check. Forgejo is the git + OCI-registry backbone and legitimately needs ~4Gi, so the edge tier's 4Gi ceiling is too tight. Opt the namespace out of the auto tier quota (resource-governance/custom-quota=true) and define a forgejo-specific ResourceQuota at requests.memory=8Gi, so the 4Gi pod sits at ~50% with headroom. Same opt-out pattern dbaas uses. Re-tiering was rejected: tier 1-cluster is also 4Gi, and 0-core (8Gi) would over-classify Forgejo's priority/eviction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 17:16:47 +00:00
Viktor Barzin
72982683bc docs(CLAUDE.md): k8s-portal now GHA->ghcr, not a Woodpecker build
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-portal was the last in-cluster image builder. Its .woodpecker/k8s-portal.yml
was deleted; it now builds on GHA (build-k8s-portal.yml) -> PRIVATE ghcr, pulled
via the Kyverno ghcr-credentials allowlist and deployed by Keel. Fix the CI/CD
section: drop k8s-portal from the Woodpecker-pipelines list (stale), move it from
'already on GHA' to the infra-owned private-ghcr images, and add it to the
PRIVATE ghcr allowlist roster. Completes the no-local-builds migration.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 16:10:56 +00:00
Viktor Barzin
25a39fd54e k8s-portal: wire private-ghcr pull (allowlist + imagePullSecrets)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-portal was the last in-cluster image build; it now builds on GHA and
pushes ghcr.io/viktorbarzin/k8s-portal:latest, which is PRIVATE (infra repo
default). To pull it: add k8s-portal to the sync-ghcr-credentials Kyverno
allowlist (clones the ghcr-credentials Secret into the namespace) and
reference that secret via imagePullSecrets on the deployment — same wiring
as tripit/recruiter-responder. Completes the no-local-builds migration so
nothing builds container images on the cluster anymore (ADR-0002).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:38:42 +00:00
Viktor Barzin
a7d33abec9 k8s-portal: commit package.json + lock (force; was gitignored) — unblocks GHA build
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build k8s-portal / build (push) Has been cancelled
Recovered the real manifest + resolved lockfile (lockfileVersion 3, 71 pkgs)
from the running pod. A parent .gitignore force-ignored package.json, so the
git source tree was incomplete and the image only ever built manually. Now
reproducible on GHA (ADR-0002 no-local-builds).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:29:27 +00:00
Viktor Barzin
a9b08c03cf fix(k8s-portal): npm install (no committed lockfile) so GHA can build
Some checks are pending
Build k8s-portal / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
package-lock.json was never committed to either lineage — npm ci needs it,
so the build only ever worked from a manual devvm build with a local lock.
npm install resolves from package.json, unblocking the GHA build (ADR-0002).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:26:42 +00:00
Viktor Barzin
bdfdf8db72 fix(ci): k8s-portal build context is stacks/k8s-portal/modules/k8s-portal/files (was stale platform/ path)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:23:46 +00:00
Viktor Barzin
b906f61ac3 k8s-portal: build off-infra GHA -> ghcr + Keel; remove Woodpecker build (no-local-builds)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The last in-cluster image build. GHA build-k8s-portal.yml builds
ghcr.io/viktorbarzin/k8s-portal:latest+sha (path-filtered on the Dockerfile
dir); Keel (force/poll/match-tag) rolls the deployment. Stack image repointed
to ghcr (ignore_changed); .woodpecker/k8s-portal.yml deleted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:21:35 +00:00
Viktor Barzin
9501da81a0 dbaas: document postgresql-backup startingDeadlineSeconds rationale
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Inline note on why the four backup CronJobs moved 10s->600s (bda1bdcb): a 10s deadline silently dropped the 2026-06-13 midnight full-backup run, firing PostgreSQLBackupStale. bda1bdcb rode in the same push as a forgejo change that failed CI on a namespace-quota error, so that pipeline failed before the dbaas apply took effect (live deadline was still 10s). This dbaas-only commit re-triggers the dbaas apply at a clean master so the 600s deadline actually goes live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:22:24 +00:00
Viktor Barzin
ba72621e52 forgejo: 6Gi exceeded namespace quota, set to 4Gi (quota ceiling)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The 3Gi->6Gi bump in ff3cc44a was rejected by the forgejo namespace tier-quota (requests.memory capped at 4Gi). With Guaranteed QoS the 6Gi request exceeded quota; FailedCreate left forgejo with 0 pods for ~6 min (git remote + OCI registry outage) until I patched the live Deployment back to a schedulable 4Gi. 4Gi is the most the quota allows and is still a headroom bump over the OOM-prone 3Gi. To go higher the tier-quota must be raised in the same change. This reconciles TF to the live 4Gi so the pending/next apply is a no-op rather than reverting to the quota-busting 6Gi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:13:36 +00:00
Viktor Barzin
ff3cc44a29 forgejo: raise memory limit from 3Gi to 6Gi (OOMKilled at 3Gi)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Forgejo OOMKilled twice on 2026-06-13 at the 3Gi cap (exit 137), briefly taking the git remote and OCI registry down and spiking ingress TTFB to 4.7s and the 4xx rate to 51%. Steady-state is ~2.2Gi but it spiked into the cap (true demand above 3.2Gi). The 2026-06-09 bump to 3Gi was sized for tripit buildkit registry pushes, but that driver is gone now that the Forgejo registry was frozen and emptied today (ADR-0002, images on ghcr), so the spike is git ops / the integrity-probe catalog walk / a possible leak. 6Gi gives headroom on the critical git backbone while we watch whether working-set keeps climbing (which would indicate a leak).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:02:55 +00:00
Viktor Barzin
bda1bdcbf3 dbaas: widen backup CronJob startingDeadlineSeconds from 10s to 600s
The daily full PostgreSQL backup silently skipped its 2026-06-13 00:00 run, leaving the last full dump 37h old and firing the critical PostgreSQLBackupStale alert. Root cause: startingDeadlineSeconds was 10s on all four dbaas backup CronJobs, so when the CronJob controller was more than 10s late to the midnight tick (many IO-heavy backups all fire at 00:00, the known etcd-starvation window) the run was dropped entirely instead of starting late. 600s lets a brief controller lag still launch the job. Applied to all four (mysql + pg, full + per-db) since they share the footgun and the midnight contention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:02:54 +00:00
Viktor Barzin
3e82c64a76 docs: sync CI/CD docs to ADR-0002 final state (ghcr + Woodpecker deploy-only) [ci skip]
ADR-0002 is fully landed (issues #11-#32 closed): every owned image now
builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/<name>, with
Woodpecker reduced to deploy-only. The Forgejo container registry is frozen
and emptied; there are no in-cluster image builds or CI test runs anywhere.
The docs still described the old hybrid topology (DockerHub builds,
Woodpecker-native owned-app builds, the per-pattern migration lists, the
tripit-only pilot framing), which would mislead future sessions and
incident response.

This brings the docs to the completed reality (closes #33):

- docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference —
  the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package
  split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen
  Forgejo registry, what Woodpecker still runs, and the #31 decommissions.
- .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the
  fleet-wide final state; FIX the stale claim that claude-memory-mcp builds
  to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the
  Forgejo registry is frozen/break-glass near the image-registry bullet.
- .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker
  deploy-only (was "Woodpecker-native build->deploy").
- stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf:
  cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no
  CI pipeline). Description/comment text only — no stack logic changed.

Historical records (docs/post-mortems/*, docs/plans/*) and ADR-0002 itself
are left untouched as point-in-time records.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 12:55:49 +00:00
Viktor Barzin
6e4db0ddc6 openclaw + f1-stream: last forgejo image refs -> ghcr (ADR-0002 #32 prep)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
openclaw's install-nextcloud-todos-plugin init still pulled forgejo
nextcloud-todos (would ImagePullBackOff on restart once the forgejo
registry is wiped) -> ghcr:latest. f1-stream stack base (KEEL_IGNORE'd,
live already ghcr via set-image) repointed for fresh-create correctness.
Clears the last LIVE forgejo viktor/* refs before the registry reclaim.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 12:36:10 +00:00
Viktor Barzin
3c3e6bfc95 ci: retire in-cluster infra-ci build; breakglass becomes manual ghcr pull-and-save (ADR-0002 #30)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
infra-ci now builds on GHA → ghcr and the ghcr-based apply is PROVEN
(pipeline 165 ran terragrunt apply in the ghcr image). Removing the
Woodpecker build-ci-image.yml (clean cut). The breakglass tarball is
preserved as a MANUAL Woodpecker job pulling ghcr (public) → registry VM;
infra-ci on ghcr is external + node-cached, so the Forgejo-down rationale
for the old auto-tarball is moot — this is belt-and-braces DR.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 10:07:58 +00:00
Viktor Barzin
ee25a41c74 ci: apply + drift steps run on ghcr infra-ci (ADR-0002 #30)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The terragrunt apply step (default.yml) and drift-detection now pull
ghcr.io/viktorbarzin/infra-ci:latest (GHA-built, verified toolchain:
tf 1.5.7 / tg 0.99.4 / sops / kubectl 1.34 / vault / git-crypt). ghcr is
public + proven pullable in-cluster. build-ci-image.yml (forgejo build)
KEPT as the fallback copy until this ghcr-based apply is proven, so a
revert restores the working forgejo image if needed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 10:05:34 +00:00
Viktor Barzin
23fc2bf2ec ci: GHA→ghcr build for infra-ci (ADR-0002 #30, bootstrap-safe — woodpecker build kept until proven)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:53:43 +00:00
Viktor Barzin
eb8b550521 chrome-service: TF-manage novnc image (ghcr:latest), drop its KEEL_IGNORE (ADR-0002 #29)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
novnc's image was ignore_changed (KEEL_IGNORE) but nothing manages its
tag (keel.sh/policy=never), so the earlier forgejo->ghcr repoint never
took. Removing container[1].image from ignore_changes lets terragrunt
own novnc=ghcr:latest and roll it. container[0]/[2] (pinned playwright)
stay ignored.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:49:58 +00:00
Viktor Barzin
94a3d1b870 chrome-service-novnc + android-emulator images -> ghcr (ADR-0002 #29/#30)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Both now built by GHA → public ghcr. Repoint stack image bases
forgejo→ghcr:latest (terragrunt-managed, imagePullPolicy Always picks up
rebuilds). android var default api36-v8 -> latest.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:43:40 +00:00
Viktor Barzin
a69847a0f3 tripit: enable Wikipedia city cover photos (CITY_IMAGE_PROVIDER=wikipedia, #47)
Flips the planning workspace's Stay cover photos from the fake provider to live Wikipedia lead-image fetches (downloaded into STORAGE_DIR, served by the backend, editable per Stay). Part of the new-trip flow feature: every picked destination city gets a banner-ready cover. HOLD-ORDER: pushed only after the tripit image containing CityImageMode.wikipedia rolled out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 09:43:40 +00:00
Viktor Barzin
1621f0b204 ci: GHA→ghcr builds for chrome-service-novnc, android-emulator, infra CLI (ADR-0002 #29/#30)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Infra-owned rare-build images move off Woodpecker/manual to GHA (build
from the github checkout — Dockerfiles verified identical on both
remotes). chrome-service-novnc + android-emulator → public ghcr
(dispatch+path). CLI → DockerHub (kept) + ghcr; Woodpecker build-cli.yml
removed. infra-ci handled separately (bootstrap-critical).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:38:36 +00:00
Viktor Barzin
f61d707d75 travel_blog: remove decommissioned stack (ADR-0002 infra#31)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Service was already scaled 0/0 and unused (Viktor: 'not used anymore').
Live resources destroyed via scripts/tg destroy (10 resources: deployment,
namespace, service, anubis-travel + PDB/cm/svc/secret, ingress, TLS).
Removing the stack dir; old Woodpecker build (repo 5) deactivated
separately. The harmless legacy 'travel' CNAME->apex in config.tfvars is
left (now 404s; removing it would trigger a full-platform apply).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 09:32:39 +00:00
Viktor Barzin
90fb0685ae traefik: x402-gateway image forgejo -> ghcr + KEEL_IGNORE_IMAGE (ADR-0002 infra#28)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Formalizing x402-gateway CI (was a manual no-CI image). The deployment
lives in the traefik module; its image was NOT in ignore_changes, so a
set-image deploy would be reverted on the next traefik apply — added it
(KEEL_IGNORE_IMAGE). Base repointed to ghcr:latest; the GHA deploy
set-images the :sha8. Public ghcr package = no pull secret. Inert on the
live pod (image now ignored); rolling cutover keeps forwardAuth up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 02:42:45 +00:00