infra

Author	SHA1	Message	Date
Viktor Barzin	a0725ede57	chrome-service: stop ignoring container[0].image so TF re-asserts the pinned browser image The chrome-service container (container[0]) runs the pinned Microsoft Playwright image, which ships chromium under /ms-playwright. Its image was still listed in the deployment's lifecycle ignore_changes — a leftover KEEL_IGNORE from before ADR-0002 #29 moved the novnc container to TF management. With that field ignored, a stray clobber of container[0] to ghcr chrome-service-novnc:latest (which has no chromium there) stuck permanently: the container crash-looped ~12h on "chromium binary not found under /ms-playwright" (273 restarts) and TF could not revert it. Remove container[0].image from ignore_changes so Terraform pins it to local.image and re-asserts it on every apply. Both containers are TF-managed now (novnc since ADR-0002 #29); Keel is inert (policy=never), so nothing should fight TF here. Surfaced by /cluster-health. Live state was already restored transiently via kubectl set image; this commit makes the fix durable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:18:32 +00:00
Emil Barzin	1ba453c65d	fan-control docs: sync runbook/env/service/design to the HA-actuator + anti-flap model All checks were successful ci/woodpecker/push/default Pipeline was successful Details The committed docs still described the 2026-06-04 presence-aware daemon. Bring them in line with what is actually deployed: HA computes the setpoint, the host is a thin actuator (COMMAND_ENTITY/STALE_SECS/HA_GRACE_SECS), additive bias, anti-flap hold-last, and the new HA readout sensors (command/equilibrium/ cpu_load/fan_speed_avg/fan_power_avg). Earlier doc edits were made in a clone lost in the workstation reshuffle; re-created here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:11:48 +00:00
Emil Barzin	5bc3d27d1b	Merge remote-tracking branch 'forgejo/master' into emo/fan-control-ha-actuator All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-16 08:08:27 +00:00
Emil Barzin	2cfe338419	fan-control: hold last command through transient HA losses (stop fan flapping) The actuator dumped the fans to Dell auto on every brief loss of the HA command (~14% of the time, every few minutes) — crashing them to the ~7100 rpm floor and bouncing back: the "fans surge then crash then surge" the owner reported. Causes: the command sensors last_updated going >120s old whenever CPU temp sat flat (mis-read as stale), plus occasional unavailable blips. Fix: on a missing/stale command, HOLD the last applied % for up to HA_GRACE_SECS (300s) instead of falling back, and loosen STALE_SECS 120->1800 (staleness only happens at flat temp, where the held value is still valid). The 83C CPU CEILING on our own IPMI read stays the real overheat safety. Verified live: fallback 14% -> 0% over 8h, command std 16 -> 3, no more rpm floor crashes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:07:52 +00:00
Viktor Barzin	57d45d8d8f	fix(authentik): pin Vault binding UUIDs as literals (provider has no authentik_application data source) All checks were successful ci/woodpecker/push/default Pipeline was successful Details CI pipeline 198 failed: the pinned goauthentik/authentik provider has no data "authentik_application" source, so terraform failed the whole authentik plan and applied NOTHING (state unchanged). Replace the data-source lookups with the live pbm_uuid (Vault app) and group_uuid (Allow Login Users) as literals; authentik_policy_binding is supported (used in guest.tf). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:01:29 +00:00
Viktor Barzin	aa461b95bc	feat(authentik): bind Vault OIDC app to Allow Login Users (close ADR-0020 OIDC gap) Some checks failed ci/woodpecker/push/default Pipeline failed Details Audit found the Vault Authentik application had no authorization binding, so any authenticated identity (incl. a future self-enrolled TripIt External user) could complete Vault OIDC login and get a built-in default-policy token. Bind it to 'Allow Login Users' — existing homelab users inherit that group via its children (verified User.all_groups() includes the parent), parentless TripIt External users are excluded. Closes the only OIDC app the forward-auth fence does not cover. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 21:48:04 +00:00
Viktor Barzin	cbca281aaa	feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020) Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 21:48:04 +00:00
Viktor Barzin	cf51cb45de	docs(adr-0003): keep Forgejo canonical, complete the GitHub mirror (reject swap) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Grilled the 'swap Forgejo for GitHub' idea. Root cause of the divergence pain is an incomplete push-mirror rollout (14 repos dual-pushed, push_mirrors=0), not Forgejo itself — and CONTEXT.md already documents Forgejo-canonical + one-way GitHub mirror. Decision: don't swap; finish the mirror, name the GitHub-first exceptions, reconcile infra, enforce one-remote-per-clone. Adds ADR-0003 + the GitHub-first repo glossary term + dual-push/force-overwrite warnings on Canonical repo / GitHub mirror. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-15 21:32:28 +00:00
Viktor Barzin	5d3a166b94	t3-afk: fix agent Bash — stop mounting into ~/.claude Some checks failed ci/woodpecker/push/default Pipeline failed Details Root cause of "the agent never commits": the issue-implementer CLAUDE.md was subPath-mounted at /home/node/.claude/CLAUDE.md, which made /home/node/.claude root-owned. The agent (uid 1000) then couldn't create its Bash session-env there, so EVERY Bash/git call failed (Write/Edit worked, so it silently edited but never committed). Found by reading the agent transcripts from state.sqlite -> projection_thread_messages. Fix: don't mount anything into ~/.claude (it's not honored by T3's SDK anyway). Behaviour is injected via the dispatch message preamble by the control plane; files/issue-implementer-CLAUDE.md kept as the canonical source text. Verified post-fix: a preamble-dispatched task edited README and COMMITTED (073ab28) unattended. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:49:34 +00:00
Viktor Barzin	34c30ac2bf	t3-afk: auto-pair dispatcher sidecar — no manual pairing All checks were successful ci/woodpecker/push/default Pipeline was successful Details The bare `t3 serve` behind Authentik showed the manual /pair#token screen, which didn't connect. Mirror the devvm t3-dispatch: a small stdlib-Node sidecar fronts t3 serve, and on a cookieless (already Authentik-gated) document load it mints a pairing credential (`t3 auth pairing create`) and exchanges it at /api/auth/browser-session for the t3_session cookie, then 302s back. Everything else — including WebSocket upgrades for the live cockpit — reverse-proxies to :3773. The Service now targets the sidecar (:8080). Verified: cookieless GET -> 302 + Set-Cookie t3_session; cookied GET -> 200 SPA. Matches the t3.viktorbarzin.me experience (Authentik login -> straight into the cockpit). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:19:39 +00:00
Viktor Barzin	92c5b24975	docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Minted a dedicated classic GitHub PAT scoped to read:packages and stored it in Vault secret/viktor/ghcr_pull_token (2026-06-15), replacing the previous alias of the broad admin github_pat. Propagated via targeted apply of module.kyverno.kubernetes_secret.ghcr_credentials (Kyverno re-syncs the allowlisted namespaces). Document the new cred + the manual rotation recipe. Closes: code-h2il Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-15 20:19:17 +00:00
Viktor Barzin	ef555c7e02	workstation: put ~/.local/bin on PATH so the launcher finds native claude All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor hit "~/.local/bin is not part of the PATH". Root cause: the native claude binary lives in ~/.local/bin, but the terminal launcher (start-claude.sh) runs in tmux's NON-login bash env, which doesn't source the user's shell rc where the native installer put ~/.local/bin on PATH. So `command -v claude` failed there → the launcher's bootstrap re-ran the native installer → the installer printed the PATH warning. (Interactive zsh already had ~/.local/bin via the per-user installer rc edit, and t3-serve sets PATH in its unit — so only the terminal launcher was affected.) - skel/start-claude.sh: prepend ~/.local/bin to PATH near the top (guarded/idempotent), before the launch logic — so `claude` is found, no reinstall, no warning. - setup-devvm.sh: install /etc/profile.d/10-local-bin.sh — adds ~/.local/bin to PATH for all LOGIN shells machine-wide (SSH etc.), independent of the per-user installer rc edit (fresh-user-safe). zsh login picks it up via /etc/zsh/zprofile -> /etc/profile. - docs/architecture/multi-tenancy.md: documented the three PATH-injection points. Verified: guard adds-when-missing / no-dup-when-present; all scripts pass bash -n. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:20:03 +00:00
Viktor Barzin	eecd78233b	workstation: standardize on the native claude install (drop npm-global + npx) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Question from Viktor: should claude run via the binary or npx? Answer: the native install is the recommended runtime (self-contained, self-updating ~/.local/bin/claude; installMethod=native) — and every existing user had already auto-migrated to it, leaving the npm-global copy empty and the npx fallback dead. "Leave only the recommended setup": - setup-devvm.sh: node is now installed ONLY for the t3 CLI; dropped the machine-wide `npm install -g @anthropic-ai/claude-code` (npm/npx is not the recommended runtime and just shadowed the per-user native installs). - t3-provision-users.sh: new per-user `install_user_claude_native` (runs the official https://claude.ai/install.sh AS the user, idempotent/skip-if-present) — provisions native claude for BOTH the terminal launcher and each t3-serve instance, replacing the npm bootstrap. - skel/start-claude.sh: launcher runs the native `claude` only; if missing it bootstraps via the native installer (was an `npx @anthropic-ai/claude-code` fallback). - docs/architecture/multi-tenancy.md: documented the native-only runtime model. node stays (the pinned t3 CLI is npm-global). Verified: native installer reachable + produces ~/.local/bin/claude 2.1.177; all three scripts pass bash -n + shellcheck. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:12:05 +00:00
Viktor Barzin	4a48f065e9	mcp: drop project-scoped paperless from .mcp.json (paperless is now wizard-only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Paperless is a personal tool for wizard, not shared. It was project-scoped in the infra repo's .mcp.json (the in-cluster paperless-mcp proxy), so every user whose ~/code IS an infra clone (emo, ancamilea) auto-loaded it. Per request, paperless should be wizard-only: wizard now runs his own direct, token-based paperless MCP in his user-scope config (a local barryw/paperlessmcp container -> paperless-ngx). Removing the shared entry so emo and other infra-clone users no longer get it; the `ha` MCP stays project-scoped. emo's clone drops it on next freshen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:03:37 +00:00
Viktor Barzin	bb3f5f2329	workstation: stop the Claude Code onboarding wizard reappearing for terminal users All checks were successful ci/woodpecker/push/default Pipeline was successful Details emo reported being "logged out" on terminal.viktorbarzin.me: every new shell dropped him at the first-run "Choose the text style" wizard, even though he'd used many sessions and is in fact fully authenticated. Root cause is NOT a logout — ~/.claude.json is a single file that all of a user's concurrent claude processes (the ttyd terminal + their t3-serve instance + agent sessions) read-modify-write, and a stale writer periodically drops top-level keys, including hasCompletedOnboarding. That bounces the next interactive session back to onboarding; credentials are safe in the separate ~/.claude/.credentials.json (which is why T3 kept working). wizard's own ~/.claude.json showed the same key loss, so this hits any heavy multi-session user. Fix: - skel/start-claude.sh: ensure_onboarding() idempotently re-asserts hasCompletedOnboarding (+ lastOnboardingVersion) in ~/.claude.json right before launching claude. Merge-only (never clobbers other keys), runs as the user, and no-ops if jq is missing or the file is empty/corrupt. So even if the race drops the flag, the next launch restores it before claude reads it. - t3-provision-users.sh: deploy_user_launcher() re-copies skel/start-claude.sh into every non-admin home (copy-if-changed) on the hourly reconcile. /etc/skel only seeds the launcher at account creation, so without this the fix (and any future launcher edit) would never reach existing users. .tmux.conf is deliberately not re-copied — terminal-lobby appends a managed section to it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:37:59 +00:00
Viktor Barzin	82a0c5aedf	t3-afk: fix crashloop — exclude from Keel at the deployment level All checks were successful ci/woodpecker/push/default Pipeline was successful Details Keel "patch"-downgraded the image docker.io/library/node:24 -> library/node:24.0.2, which is below t3@0.0.27's required node >=24.10, so `t3 serve` exited silently and the pod crash-looped (~160 restarts / 13h). Root cause: keel.sh/policy=never was on the POD-TEMPLATE labels, but Keel reads the policy at the DEPLOYMENT level. The cluster's Kyverno inject-keel-annotations is opt-out, so it stamped policy=patch and Keel acted on it. Fix: set keel.sh/policy=never as a deployment-level annotation; ignore_changes the Kyverno-injected keel.sh/pollSchedule + keel.sh/trigger annotations; the image stays TF-owned (apply reverted Keel's downgrade). Pod now 1/1, t3 serve 200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 10:32:38 +00:00
Viktor Barzin	214638216b	fix(anisette): wait_for_rollout=false so a slow first start can't strand the deploy out of state All checks were successful ci/woodpecker/push/default Pipeline was successful Details The docker.io fix created the deployment, but wait_for_rollout (default true) then hung on the OOMing pod and the apply failed — leaving the deployment in the cluster but NOT in terraform state, so every later apply hit 'deployments.apps "anisette" already exists'. Deleted that orphan and set wait_for_rollout=false (mirrors tts/llama-cpp slow-start services); readiness probe still gates Service traffic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:56:30 +00:00
Viktor Barzin	d8c60d7ab8	t3-afk: dedicated in-cluster T3 Code instance (AFK executor + cockpit) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Slice #2 of claude-agent-service PRD #1 (AFK implementation pipeline). Dedicated in-cluster T3 Code instance the control plane dispatches issues into; runs the issue-implementer agent in a git worktree with a live cockpit. Applied + live 2026-06-14 (9 resources). Pilot-fast: stock docker.io/library/node:24 + install pinned t3@0.0.27 + Claude CLI at startup onto an SSD-NFS PVC. Authentik-gated ingress. issue-implementer behaviour ships as a user-level ~/.claude/CLAUDE.md (T3 hardcodes the system prompt; settingSources loads it) and forbids plan-mode/clarifying-questions so unattended threads don't stall. Keel-excluded (ADR 0003). wait_for_rollout=false (slow first start). Image fully-qualified for the Kyverno trusted-registries allowlist; container mem limit 4Gi (tier-aux LimitRange cap). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 20:06:33 +00:00
Viktor Barzin	bc7b28244f	fix(anisette): raise memory limit to 512Mi — 128Mi OOMKilled at startup Some checks failed ci/woodpecker/push/default Pipeline failed Details The pod CrashLooped with OOMKilled (exit 137): anisette downloads and initializes Apple's CoreADI provisioning library on startup, spiking past the 128Mi limit before it can bind :6969 (empty logs, liveness 'connection refused'). Bump request 256Mi / limit 512Mi; steady state is much lower. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 19:54:13 +00:00
Viktor Barzin	96addf65b4	fix(anisette): docker.io/ image prefix to pass Kyverno require-trusted-registries Some checks failed ci/woodpecker/push/default Pipeline was canceled Details First apply was denied at admission — a bare dadoum/anisette-v3-server@sha256 ref isn't in the trusted-registries allowlist (only enumerated DockerHub user-repo prefixes are). docker.io/* IS allowlisted, so use the explicit registry prefix; still pulls via the 10.0.20.10 pull-through cache. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 19:47:05 +00:00
Viktor Barzin	0bfa6f0774	feat(anisette): self-hosted Apple anisette server for SideStore (infra #40 ) Some checks failed ci/woodpecker/push/default Pipeline failed Details Deploy a small stateless anisette-data server so the TripIt iOS Shell can be sideloaded with SideStore using a free Apple ID, without brokering the Apple-ID auth dance through a public third-party anisette server (which would see every login). SideStore points at a stable internal endpoint we control. - Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub releases / semver / sha tags), so pinned by manifest digest instead of a tag per the "never :latest" rule. Pulled from DockerHub via the registry-VM pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so a new upstream build prompts a digest re-pin. - Stateless: emptyDir backs the provisioning-library cache dir (regenerable download; upstream issue #23 means it doesn't preserve client auth across restarts anyway) — no PVC, no Vault secret. - Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none, allow_local_access_only, ssl_redirect off) — SideStore is a native client that can't do the Authentik cookie dance, same reasoning as android-emulator's adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never publicly exposed. Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service catalog updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 19:35:57 +00:00
Viktor Barzin	fe1f8d62e7	tripit: re-apply tripit stack to land CITY_IMAGE_PROVIDER=wikipedia All checks were successful ci/woodpecker/push/default Pipeline was successful Details The commit that enabled real city cover photos (`a69847a0`, CITY_IMAGE_PROVIDER=wikipedia, #47) was committed to master but its CI run skipped the tripit stack apply (changed-stack diff race — same class as the prior "re-apply after pipeline race" fixes). The env never landed in-cluster, so the provider stayed on its fake 1x1-PNG default and every trip/stay cover rendered blank/placeholder in prod. This comment touch forces CI to re-apply the tripit stack; terraform then reconciles the drift (desired HCL already has the env) so the deployment picks up CITY_IMAGE_PROVIDER=wikipedia. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 17:45:07 +00:00
Viktor Barzin	2df6ebf305	health: fix middleware ref namespace prefix (restore site from 404) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details My previous commit referenced the new limiter as `health-rate-limit@kubernetescrd`, omitting the namespace prefix. Traefik CRD middleware refs are `<namespace>-<name>@kubernetescrd`, and the Middleware lives in the `traefik` ns, so the router couldn't resolve it — Traefik failed the whole health.viktorbarzin.me router and returned 404 on every path (the app + pod were healthy throughout; verified via port-forward). Correct it to `traefik-health-rate-limit@kubernetescrd`, matching the working traefik-tripit-rate-limit / traefik-actualbudget-rate-limit references. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 17:43:08 +00:00
Viktor Barzin	086ff85911	health: dedicated 100/1000 rate limit for the redesigned SPA Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor hit 429s browsing the redesigned health app. The default shared limiter is 10 req/s / burst 50, but each page load is the shell (JS chunks + two self-hosted Geist woff2) plus a 5-8 call API burst, so fast tab-to-tab navigation from one client IP overruns burst 50 — Traefik 429s the tail and the affected cards/pages render empty. Give health its own limiter (average 100, burst 1000) and skip the default, exactly as tripit/immich/actualbudget/ha-sofia already do for the same parallel-burst pattern. Attached via the ingress_factory escape hatch (skip_default_rate_limit + extra_middlewares). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 13:03:51 +00:00
Viktor Barzin	6dc77f4612	uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 09:11:22 +00:00
Viktor Barzin	05bec26d09	health: internal test-access ingress + DEV_AUTH_EMAIL (ADR-0008) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Add health-test.viktorbarzin.lan (auth=none, allow_local_access_only, anti-AI off) pointing at the same health deployment, plus a DEV_AUTH_EMAIL=vbarzin@gmail.com env on the container. Lets automated E2E / Playwright / manual screenshots reach the live app without the Authentik SSO redirect, for testing — while the public health.viktorbarzin.me ingress stays auth=required (forward-auth fails closed, so the public path always carries the real X-authentik-email header and never hits the DEV_AUTH_EMAIL fallback). LAN-only, no public exposure. Decision recorded in health repo ADR-0008. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 04:02:34 +00:00
Viktor Barzin	e6699ed20b	uptime-kuma: retry Kuma login in monitor-sync jobs (intermittent socket.io timeout) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The internal + external monitor-sync CronJobs intermittently failed with socketio.exceptions.TimeoutError on api.login(), firing JobFailed -> Slack noise (and leaving monitor sync stale). Kuma 2.3.2 itself is healthy (1/1, 30m CPU); its single Node event loop just briefly stalls under ~300 monitors so the socket.io login handshake occasionally exceeds the client timeout. Wrap connect+login in a 5-attempt / 15s-backoff retry (disconnecting the half-open client between tries) so a transient stall no longer fails the whole job. Applied to both sync scripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 20:54:14 +00:00
Viktor Barzin	a6381b8cf8	forgejo: custom 8Gi ResourceQuota (was pegged at the 4Gi tier cap) Some checks failed ci/woodpecker/push/default Pipeline failed Details Yesterday's Forgejo 3Gi->4Gi OOM fix pushed its tier-3-edge namespace quota (requests.memory=4Gi) to 100%, firing KubeQuotaAlmostFull + the healthcheck resourcequota check. Forgejo is the git + OCI-registry backbone and legitimately needs ~4Gi, so the edge tier's 4Gi ceiling is too tight. Opt the namespace out of the auto tier quota (resource-governance/custom-quota=true) and define a forgejo-specific ResourceQuota at requests.memory=8Gi, so the 4Gi pod sits at ~50% with headroom. Same opt-out pattern dbaas uses. Re-tiering was rejected: tier 1-cluster is also 4Gi, and 0-core (8Gi) would over-classify Forgejo's priority/eviction. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 17:16:47 +00:00
Viktor Barzin	72982683bc	docs(CLAUDE.md): k8s-portal now GHA->ghcr, not a Woodpecker build All checks were successful ci/woodpecker/push/default Pipeline was successful Details k8s-portal was the last in-cluster image builder. Its .woodpecker/k8s-portal.yml was deleted; it now builds on GHA (build-k8s-portal.yml) -> PRIVATE ghcr, pulled via the Kyverno ghcr-credentials allowlist and deployed by Keel. Fix the CI/CD section: drop k8s-portal from the Woodpecker-pipelines list (stale), move it from 'already on GHA' to the infra-owned private-ghcr images, and add it to the PRIVATE ghcr allowlist roster. Completes the no-local-builds migration. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 16:10:56 +00:00
Viktor Barzin	25a39fd54e	k8s-portal: wire private-ghcr pull (allowlist + imagePullSecrets) All checks were successful ci/woodpecker/push/default Pipeline was successful Details k8s-portal was the last in-cluster image build; it now builds on GHA and pushes ghcr.io/viktorbarzin/k8s-portal:latest, which is PRIVATE (infra repo default). To pull it: add k8s-portal to the sync-ghcr-credentials Kyverno allowlist (clones the ghcr-credentials Secret into the namespace) and reference that secret via imagePullSecrets on the deployment — same wiring as tripit/recruiter-responder. Completes the no-local-builds migration so nothing builds container images on the cluster anymore (ADR-0002). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 15:38:42 +00:00
Viktor Barzin	a7d33abec9	k8s-portal: commit package.json + lock (force; was gitignored) — unblocks GHA build Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build k8s-portal / build (push) Has been cancelled Details Recovered the real manifest + resolved lockfile (lockfileVersion 3, 71 pkgs) from the running pod. A parent .gitignore force-ignored package.json, so the git source tree was incomplete and the image only ever built manually. Now reproducible on GHA (ADR-0002 no-local-builds). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 15:29:27 +00:00
Viktor Barzin	a9b08c03cf	fix(k8s-portal): npm install (no committed lockfile) so GHA can build Some checks are pending Build k8s-portal / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details package-lock.json was never committed to either lineage — npm ci needs it, so the build only ever worked from a manual devvm build with a local lock. npm install resolves from package.json, unblocking the GHA build (ADR-0002). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 15:26:42 +00:00
Viktor Barzin	bdfdf8db72	fix(ci): k8s-portal build context is stacks/k8s-portal/modules/k8s-portal/files (was stale platform/ path) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 15:23:46 +00:00
Viktor Barzin	b906f61ac3	k8s-portal: build off-infra GHA -> ghcr + Keel; remove Woodpecker build (no-local-builds) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details The last in-cluster image build. GHA build-k8s-portal.yml builds ghcr.io/viktorbarzin/k8s-portal:latest+sha (path-filtered on the Dockerfile dir); Keel (force/poll/match-tag) rolls the deployment. Stack image repointed to ghcr (ignore_changed); .woodpecker/k8s-portal.yml deleted. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 15:21:35 +00:00
Viktor Barzin	9501da81a0	dbaas: document postgresql-backup startingDeadlineSeconds rationale All checks were successful ci/woodpecker/push/default Pipeline was successful Details Inline note on why the four backup CronJobs moved 10s->600s (`bda1bdcb`): a 10s deadline silently dropped the 2026-06-13 midnight full-backup run, firing PostgreSQLBackupStale. `bda1bdcb` rode in the same push as a forgejo change that failed CI on a namespace-quota error, so that pipeline failed before the dbaas apply took effect (live deadline was still 10s). This dbaas-only commit re-triggers the dbaas apply at a clean master so the 600s deadline actually goes live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 14:22:24 +00:00
Viktor Barzin	ba72621e52	forgejo: 6Gi exceeded namespace quota, set to 4Gi (quota ceiling) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details The 3Gi->6Gi bump in `ff3cc44a` was rejected by the forgejo namespace tier-quota (requests.memory capped at 4Gi). With Guaranteed QoS the 6Gi request exceeded quota; FailedCreate left forgejo with 0 pods for ~6 min (git remote + OCI registry outage) until I patched the live Deployment back to a schedulable 4Gi. 4Gi is the most the quota allows and is still a headroom bump over the OOM-prone 3Gi. To go higher the tier-quota must be raised in the same change. This reconciles TF to the live 4Gi so the pending/next apply is a no-op rather than reverting to the quota-busting 6Gi. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 14:13:36 +00:00
Viktor Barzin	ff3cc44a29	forgejo: raise memory limit from 3Gi to 6Gi (OOMKilled at 3Gi) Some checks failed ci/woodpecker/push/default Pipeline failed Details Forgejo OOMKilled twice on 2026-06-13 at the 3Gi cap (exit 137), briefly taking the git remote and OCI registry down and spiking ingress TTFB to 4.7s and the 4xx rate to 51%. Steady-state is ~2.2Gi but it spiked into the cap (true demand above 3.2Gi). The 2026-06-09 bump to 3Gi was sized for tripit buildkit registry pushes, but that driver is gone now that the Forgejo registry was frozen and emptied today (ADR-0002, images on ghcr), so the spike is git ops / the integrity-probe catalog walk / a possible leak. 6Gi gives headroom on the critical git backbone while we watch whether working-set keeps climbing (which would indicate a leak). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 14:02:55 +00:00
Viktor Barzin	bda1bdcbf3	dbaas: widen backup CronJob startingDeadlineSeconds from 10s to 600s The daily full PostgreSQL backup silently skipped its 2026-06-13 00:00 run, leaving the last full dump 37h old and firing the critical PostgreSQLBackupStale alert. Root cause: startingDeadlineSeconds was 10s on all four dbaas backup CronJobs, so when the CronJob controller was more than 10s late to the midnight tick (many IO-heavy backups all fire at 00:00, the known etcd-starvation window) the run was dropped entirely instead of starting late. 600s lets a brief controller lag still launch the job. Applied to all four (mysql + pg, full + per-db) since they share the footgun and the midnight contention. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 14:02:54 +00:00
Emil Barzin	082bdfcc77	fan-control: thin actuator — HA computes the setpoint, host only applies it The R730 fan-control logic now lives entirely in Home Assistant: the curve thresholds, duty %, bias and asymmetric deadband, plus manual/lock, are set on the dashboard and published as sensor.r730_fan_command_pct. The host daemon is reduced to a thin actuator — it reads that one number each loop, validates it (numeric + not older than STALE_SECS) and applies it over IPMI. Removed the presence-aware two-curve logic and the garage-door coupling. Safety stays independent on the host: CPU>=CEILING, repeated IPMI failures, or HA unreachable/stale all hand the fans back to Dell auto. RPM telemetry now averages all 6 chassis fans. Deployed and verified live on pve (applies the HA command; fans follow). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 12:59:57 +00:00
Viktor Barzin	3e82c64a76	docs: sync CI/CD docs to ADR-0002 final state (ghcr + Woodpecker deploy-only) [ci skip] ADR-0002 is fully landed (issues #11-#32 closed): every owned image now builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/<name>, with Woodpecker reduced to deploy-only. The Forgejo container registry is frozen and emptied; there are no in-cluster image builds or CI test runs anywhere. The docs still described the old hybrid topology (DockerHub builds, Woodpecker-native owned-app builds, the per-pattern migration lists, the tripit-only pilot framing), which would mislead future sessions and incident response. This brings the docs to the completed reality (closes #33): - docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference — the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen Forgejo registry, what Woodpecker still runs, and the #31 decommissions. - .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the fleet-wide final state; FIX the stale claim that claude-memory-mcp builds to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the Forgejo registry is frozen/break-glass near the image-registry bullet. - .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker deploy-only (was "Woodpecker-native build->deploy"). - stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf: cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no CI pipeline). Description/comment text only — no stack logic changed. Historical records (docs/post-mortems/, docs/plans/) and ADR-0002 itself are left untouched as point-in-time records. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 12:55:49 +00:00
Viktor Barzin	6e4db0ddc6	openclaw + f1-stream: last forgejo image refs -> ghcr (ADR-0002 #32 prep) All checks were successful ci/woodpecker/push/default Pipeline was successful Details openclaw's install-nextcloud-todos-plugin init still pulled forgejo nextcloud-todos (would ImagePullBackOff on restart once the forgejo registry is wiped) -> ghcr:latest. f1-stream stack base (KEEL_IGNORE'd, live already ghcr via set-image) repointed for fresh-create correctness. Clears the last LIVE forgejo viktor/* refs before the registry reclaim. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 12:36:10 +00:00
Viktor Barzin	3c3e6bfc95	ci: retire in-cluster infra-ci build; breakglass becomes manual ghcr pull-and-save (ADR-0002 #30 ) All checks were successful ci/woodpecker/push/default Pipeline was successful Details infra-ci now builds on GHA → ghcr and the ghcr-based apply is PROVEN (pipeline 165 ran terragrunt apply in the ghcr image). Removing the Woodpecker build-ci-image.yml (clean cut). The breakglass tarball is preserved as a MANUAL Woodpecker job pulling ghcr (public) → registry VM; infra-ci on ghcr is external + node-cached, so the Forgejo-down rationale for the old auto-tarball is moot — this is belt-and-braces DR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 10:07:58 +00:00
Viktor Barzin	ee25a41c74	ci: apply + drift steps run on ghcr infra-ci (ADR-0002 #30 ) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The terragrunt apply step (default.yml) and drift-detection now pull ghcr.io/viktorbarzin/infra-ci:latest (GHA-built, verified toolchain: tf 1.5.7 / tg 0.99.4 / sops / kubectl 1.34 / vault / git-crypt). ghcr is public + proven pullable in-cluster. build-ci-image.yml (forgejo build) KEPT as the fallback copy until this ghcr-based apply is proven, so a revert restores the working forgejo image if needed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 10:05:34 +00:00
Viktor Barzin	23fc2bf2ec	ci: GHA→ghcr build for infra-ci (ADR-0002 #30 , bootstrap-safe — woodpecker build kept until proven) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 09:53:43 +00:00
Viktor Barzin	eb8b550521	chrome-service: TF-manage novnc image (ghcr:latest), drop its KEEL_IGNORE (ADR-0002 #29 ) All checks were successful ci/woodpecker/push/default Pipeline was successful Details novnc's image was ignore_changed (KEEL_IGNORE) but nothing manages its tag (keel.sh/policy=never), so the earlier forgejo->ghcr repoint never took. Removing container[1].image from ignore_changes lets terragrunt own novnc=ghcr:latest and roll it. container[0]/[2] (pinned playwright) stay ignored. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 09:49:58 +00:00
Viktor Barzin	94a3d1b870	chrome-service-novnc + android-emulator images -> ghcr (ADR-0002 #29/#30) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Both now built by GHA → public ghcr. Repoint stack image bases forgejo→ghcr:latest (terragrunt-managed, imagePullPolicy Always picks up rebuilds). android var default api36-v8 -> latest. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 09:43:40 +00:00
Viktor Barzin	a69847a0f3	tripit: enable Wikipedia city cover photos (CITY_IMAGE_PROVIDER=wikipedia, #47 ) Flips the planning workspace's Stay cover photos from the fake provider to live Wikipedia lead-image fetches (downloaded into STORAGE_DIR, served by the backend, editable per Stay). Part of the new-trip flow feature: every picked destination city gets a banner-ready cover. HOLD-ORDER: pushed only after the tripit image containing CityImageMode.wikipedia rolled out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 09:43:40 +00:00
Viktor Barzin	1621f0b204	ci: GHA→ghcr builds for chrome-service-novnc, android-emulator, infra CLI (ADR-0002 #29/#30) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Infra-owned rare-build images move off Woodpecker/manual to GHA (build from the github checkout — Dockerfiles verified identical on both remotes). chrome-service-novnc + android-emulator → public ghcr (dispatch+path). CLI → DockerHub (kept) + ghcr; Woodpecker build-cli.yml removed. infra-ci handled separately (bootstrap-critical). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 09:38:36 +00:00
Viktor Barzin	f61d707d75	travel_blog: remove decommissioned stack (ADR-0002 infra#31) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Service was already scaled 0/0 and unused (Viktor: 'not used anymore'). Live resources destroyed via scripts/tg destroy (10 resources: deployment, namespace, service, anubis-travel + PDB/cm/svc/secret, ingress, TLS). Removing the stack dir; old Woodpecker build (repo 5) deactivated separately. The harmless legacy 'travel' CNAME->apex in config.tfvars is left (now 404s; removing it would trigger a full-platform apply). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 09:32:39 +00:00
Viktor Barzin	90fb0685ae	traefik: x402-gateway image forgejo -> ghcr + KEEL_IGNORE_IMAGE (ADR-0002 infra#28) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ci/woodpecker/push/build-cli Pipeline was successful Details Formalizing x402-gateway CI (was a manual no-CI image). The deployment lives in the traefik module; its image was NOT in ignore_changes, so a set-image deploy would be reverted on the next traefik apply — added it (KEEL_IGNORE_IMAGE). Base repointed to ghcr:latest; the GHA deploy set-images the :sha8. Public ghcr package = no pull secret. Inert on the live pod (image now ignored); rolling cutover keeps forwardAuth up. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 02:42:45 +00:00

1 2 3 4 5 ...

4319 commits