infra

Author	SHA1	Message	Date
Viktor Barzin	90c944a265	woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128 All checks were successful ci/woodpecker/push/default Pipeline was successful Details Infra pipelines were failing intermittently across all authors (e.g. #241-244, #247) with the git clone step exiting 128: git fetch --depth=1 --filter=tree:0 ... (partial/treeless clone) git reset --hard <sha> fatal: could not fetch <tree-sha> from promisor remote remote: 404 page not found The plugin-git clone defaulted to a partial (treeless) clone. The initial ref fetch carries credentials, but the lazy promisor object fetch triggered by `git reset --hard` hits the PRIVATE Forgejo repo without creds -> 404 -> exit 128. Whether it fired was luck-of-the-draw, hence the ~50% intermittent failures fleet-wide (not specific to any commit). Fix: set `partial: false` on every clone block so all objects for the (still shallow) commit are fetched upfront with creds — no fragile lazy promisor fetch. Diagnosed against the woodpecker Postgres DB (steps/log_entries) since the Woodpecker HTTP API was itself flapping. Earlier "permission for ViktorBarzin" log lines were an unrelated cross-forge red herring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 09:06:44 +00:00
Viktor Barzin	fd77c0dc4f	monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot Some checks failed ci/woodpecker/push/default Pipeline failed Details The rpi-sofia under-voltage alert keyed off the sticky firmware bit (rpi_under_voltage_occurred == 1), which latches on the first brown-out and stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a few of these lately" — and it disagreed with the HA-sofia dashboard, which shows the live state and reads OK once voltage recovers. Can't just switch to the live bit: rpi_under_voltage_now never registered once in 14d (brown-outs are sub-second and fall between the 1-min textfile-collector samples), so the sticky bit is the only reliable detector. Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0. Fires once per brown-out and auto-resolves ~1h later (~2h active over the same 14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both real brown-out events in the window are still caught. Docs updated in the same commit (monitoring.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:45:39 +00:00
Viktor Barzin	fbf6f11038	feat(tripit): #96 cutover — /api self-authenticates (remove forward-auth, add strip-auth-headers) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ADR-0028 #96 (website half): /api drops Authentik forward-auth so the browser can carry a TripIt session cookie (the outpost 302'd cookie-only requests). The app self-authenticates (TripIt-session-first in get_current_user); no session -> 401 -> SPA landing. strip-auth-headers is REQUIRED now: with forward-auth gone, the hybrid forward-auth arm would otherwise trust a client-injected X-authentik-email — stripping inbound X-authentik-* closes that. /metrics split into its own still-gated ingress. Shell keeps Authentik bearers on tripit-api.* until #94; full AUTH_MODE collapse follows then. Verified live: no-session->401, valid TripIt cookie->200, injected header->401, Shell->200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:27:39 +00:00
Viktor Barzin	8559c4574a	fix(tripit): pin Authentik invalidation_flow literal (data source flakes null in CI under provider skew) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Pipeline 244 failed: data.authentik_flow.default_provider_invalidation resolved null in CI (goauthentik 2024.x provider vs 2026.2 server), silently blocking every tripit-stack apply incl. the ADR-0028 #90 signing-key + redirect-URI delivery. Pin the literal UUID (what the slug resolves to) — matches the data-source-skew workaround used for the Vault binding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:10:25 +00:00
Viktor Barzin	e5bb16e02a	feat(tripit): activate TripIt-native session auth — signing key + Authentik web redirect (ADR-0028 #90 ) Some checks failed ci/woodpecker/push/default Pipeline failed Details Adds SESSION_SIGNING_KEY (Vault secret/tripit -> tripit-secrets ExternalSecret -> env_from) so TripIt's own session JWTs are signed with a real key (the app fails closed under the dev default until this lands), and adds the website OIDC redirect URI https://tripit.viktorbarzin.me/api/auth/callback/authentik to the public tripit-app provider so 'Log in with Authentik' works. Reuses the Shell's existing public OAuth2 app. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 06:06:43 +00:00
Viktor Barzin	077ac97df5	k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps Some checks failed ci/woodpecker/push/default Pipeline failed Details kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the k8s dashboard) until someone manually re-applied the rbac stack. That manual step ran after every control-plane upgrade — the one thing keeping autonomous patch upgrades from being truly hands-off (it bit us this cycle: an earlier master bump left SSO broken until we noticed). Automate it: the rbac stack now publishes its existing OIDC restore script (the same one its null_resource runs) to a kube-system/apiserver-oidc-restore ConfigMap, and the upgrade chain's phase_master re-runs it on master right after the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add apiserver restart can't crashloop it. The script is idempotent and health-gates /livez with auto-rollback; the step is non-fatal (a failure only lags SSO until the next rbac apply, it won't abort the upgrade). phase_master already self-skips when master is at target, so this only fires when master was actually upgraded. The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the manual restore is now a documented fallback (command corrected — it needs -replace, since the null_resource trigger hash never changes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 06:04:30 +00:00
Viktor Barzin	48b63ffa6f	homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client Some checks failed Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline failed Details Lets agents search/navigate memory via the CLI, as the first step toward deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just one frontend); homelab memory is a thin Bearer-auth HTTP client over the same API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works even when the MCP frontend is down — the recurring disconnect that took the MCP offline for this whole session. Verbs: recall (server-side semantic search), list, categories, tags, stats, secret (read); store, update, delete (write). Validated against the live API including a store→recall→delete round-trip — full data-plane parity with the MCP. The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after the CLI is proven in the hooks — see docs/adr/0008. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 05:56:25 +00:00
Viktor Barzin	3594485f77	homelab: v0.2.0 — docs + version for the k8s verb-group Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver note), add docs/adr/0007 (resolver, read/write split, config-mutation stays raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the Kubernetes surface. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 22:30:41 +00:00
Viktor Barzin	1f7438bb18	homelab: add k8s verb-group (v0.2) — the biggest remaining surface Mining the post-v0.1 corpus showed kubectl is the dominant remaining domain by far: 11,291 commands across 243 sessions (more than everything else combined). This adds the full k8s verb-group built on an app→namespace→pod resolver (most namespaces hold one app, so <app> defaults to the namespace and the target defaults to deploy/<app>, letting kubectl resolve the pod; -n/--pod/-c/-l/--tty override). Read: status (pods + non-Normal events), get, logs, describe, debug (one-shot triage), pf, rollout-status. Write/operational: db (the dbaas psql/mysql exec pattern — PG via pg-cluster-rw -c postgres, MySQL via mysql-standalone-0 with the env-password bash wrapper, never inline), exec, rm-pod (pods/jobs ONLY), restart. Config-mutation verbs (apply/edit/patch/scale/create) are deliberately NOT exposed — they stay raw per the Terraform-only policy. Smoke-verified read verbs against the live cluster (get/logs/rollout-status); write verbs are unit-tested (resolver, db-plan, shell-quoting) but not fired at live state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 22:29:51 +00:00
Viktor Barzin	66caa0bf7f	homelab: v0.1 docs, distribution wiring, and version Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Completes v0.1: documentation, build/install path, and version stamping. - cli/VERSION (v0.1.0) stamped into the binary via ldflags. - cli/README.md rewritten as the homelab overview (verbs + tiers, manifest, build, the preserved legacy webhook use-cases). - docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the work/tf behaviour (native worktree entry, verification-gated auto-land, presence-coupled apply). - setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run (t3-dispatch pattern), so every devvm user gets the current binary. - AGENTS.md: discovery pointer under Common Operations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:25:51 +00:00
Viktor Barzin	087b415f73	homelab: add work verbs (start/land/clean) with a land verification gate Completes the infra-loop verb surface. work start creates .worktrees/<topic> on <user>/<topic> off <remote>/master (git-crypt-aware, ensures .worktrees is ignored) and prints the path for native EnterWorktree entry. work land fetches, merges master in, verifies, pushes HEAD:master with non-fast-forward retry, and falls back to pushing the feature branch for a PR when the direct push is rejected (branch protection). work clean removes the worktree + branch. Safety: work land REFUSES to push when it cannot verify (no --verify-cmd and no auto-detected suite) unless --no-verify is passed. This was added after an accidental smoke-test invocation pushed unverified WIP to master (benign — the infra CI applied 0 stacks since the diff was cli/-only — but the gate makes an unverified land a deliberate choice, not the default). Known v0.1 limitation: land does not yet block on CI to green; that arrives with the ci/deploy watch verbs. It prints a reminder to follow the pipeline manually. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:24:08 +00:00
Viktor Barzin	36d562c15c	homelab: add tf verbs + stack/git-crypt substrate Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Adds the tf verb-group and the resolver substrate beneath it, continuing the v0.1 infra-loop build. - substrate: findInfraRoot (walk up to terragrunt.hcl + stacks/), stack→dir resolver, and repo/remote/git-crypt detection (preferRemote forgejo>origin, hasGitCryptAttr, gitCryptFlags) — the last is for `work` next. - tf plan/validate/fmt/force-unlock/apply, resolving the stack from cwd and delegating to scripts/tg (which owns state decrypt/encrypt, the Vault lock, and the ingress auth-comment check) rather than calling terragrunt directly. - tf apply is presence-coupled: claims stack:<name>, ALWAYS releases on exit (normal, error, or SIGINT/SIGTERM via sync.Once + signal handler) — fixing the documented ~200-claim leak — and prints an out-of-band reminder since CI applies canonically on push. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:16:33 +00:00
Viktor Barzin	ed6f22fd53	homelab: scaffold unified CLI (registry, manifest, claim/release) in infra/cli Begin evolving the existing infra/cli into the agent-facing "homelab" CLI decided in the design/grilling session: one composable, JSON-capable surface for the operations agents run over and over (mined from 51k commands across 2,225 past sessions; the infra inner-loop is ~29% of them). v0.1 targets that loop — work/tf/claim — and ships here, in place, in infra/cli. This first slice: - command registry + dispatcher (longest-prefix verb matching) and a `manifest`/`manifest --json` progressive-discovery entrypoint; every verb declares a read\|write tier so write-gating can be added later (everything is allowed for now). - claim/release verbs wrapping the existing presence script (not reimplemented), with label-taxonomy validation. - main() front-dispatches the homelab verb surface but falls through to the legacy webhook -use-case path verbatim, so the in-cluster infra-cli image is unaffected. - fix a pre-existing vet error (glog.Infof missing format directive) that blocked `go test`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:12:57 +00:00
Viktor Barzin	70e217db24	k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target All checks were successful ci/woodpecker/push/default Pipeline was successful Details The autonomous 1.34.9 version-upgrade chain has been failing its preflight every night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on 1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line, so the parsed target came back empty and the `!= requested` check aborted the whole chain before any worker was touched. Deterministic — it self-cleaned and re-failed identically each night, so it would have failed again tonight, leaving node2-6 stuck on the old patch. Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION — the same at-target self-skip that phase_master and phase_worker already do. The remaining workers are still validated by their own per-node phases, and the detector already confirmed the target is installable via apt-cache. This lets tonight's unattended chain resume and finish node2-6 -> 1.34.9. Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:17:46 +00:00
Viktor Barzin	8787d361dc	claude-memory: HA (replicas 2 + PDB) to stop recurring MCP disconnects All checks were successful ci/woodpecker/push/default Pipeline was successful Details The claude-memory MCP backend ran as a single replica with no PDB, so every voluntary disruption took it to zero for ~30-90s — which surfaced as the memory MCP "keeps getting disconnected" problem. Disruption sources hitting the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization — caught evicting it live), Keel image bumps, Reloader restarts on the 7-day DB-password rotation, node drains, and CI deploys. The local stdio MCP subprocess itself was proven healthy (fast non-blocking startup, stderr suppressed, graceful degradation), so the fault was purely backend availability, not the MCP plumbing. Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG Postgres and already has hostname anti-affinity) + restore the PDB at minAvailable=1 (safe now — the drain deadlock that justified removing it only existed at 1 replica) + descheduler evict=false to stop the needless 5-min churn. All five disruption sources become zero-downtime rolling events. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:13:36 +00:00
Viktor Barzin	48b7be3b14	feat(tripit): live lodging-price scrape — LODGING_PROVIDER=playwright All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to turn lodging prices on and stop using the fake provider. Mirrors the existing FARE_PROVIDER wiring: point the Booking.com/Airbnb lodging scraper at the shared chrome-service browser over CDP (the namespace is already admitted through chrome-service's NetworkPolicy for the fare scrape). The lodging code (ADR-0025, tripit #78) is live in tripit 03973b5, so the env lands after that rollout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:53:19 +00:00
Viktor Barzin	d709d338c6	service-catalog: add paperless-ai (RAG semantic search + auto-tagging) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Document the new paperless-ai service and the two non-obvious operational facts: runtime config lives in the PVC .env (not TF env, which would shadow it), and Qwen3 needs /no_think for parseable tagging output. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:44:00 +00:00
Viktor Barzin	4977153dfb	paperless-ai: make the PVC .env the single source of config truth All checks were successful ci/woodpecker/push/default Pipeline was successful Details Auto-tagging silently no-op'd: the container env vars set in the deployment shadowed the app's own /app/data/.env, because paperless-ai's dotenv loader does not override process.env. A stale PROCESS_PREDEFINED_DOCUMENTS=yes (with no TAGS) made the scan select zero documents. Strip the wizard-owned behavioural config (Paperless URL, AI provider, model, scan interval, tagging flags) from the container env, keeping only infrastructural env (PUID/PGID/port/RAG/HF cache) and the Vault-sourced secret refs. The app's setup-written .env on the PVC is now authoritative, so processing runs and tags all documents. Qwen3 thinking is disabled via SYSTEM_PROMPT=/no_think in that .env to keep the model's JSON output parseable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:41:29 +00:00
Viktor Barzin	aeee0d02e2	paperless-ai: deploy clusterzx/paperless-ai for semantic doc search + AI tagging Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor wanted real semantic search over his ~300 Paperless documents and preferred a ready-made solution over building one. paperless-ai provides local-embedding RAG (ChromaDB + sentence-transformers, GPU-free) plus LLM-driven auto-analysis/tagging. Wiring: - LLM (chat answers + tagging) -> in-cluster llama-swap qwen3-8b (OpenAI-compatible); embeddings + vector store are local on the PVC. - Reads Paperless over the internal service via a dedicated `paperless-ai` superuser token (Vault secret/paperless-ai); app-admin creds also in Vault. - Encrypted PVC for /app/data (SQLite + ChromaDB + model cache). - Ingress paperless-ai.viktorbarzin.me behind Authentik (auth=required). - Third-party image pinned (docker.io/clusterzx/paperless-ai:3.0.9), no Keel. Runtime config persists to the PVC .env via the app's one-time setup; the deployment env vars are pre-fill/documentation only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:23:00 +00:00
Viktor Barzin	605cf99a1b	portal-tts: docker.io/ prefix on edge-tts image (Kyverno trusted-registries) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The edge-tts apply was blocked by the require-trusted-registries Kyverno policy — a bare `travisvn/openai-edge-tts` isn't in the allowlist. The policy blanket- trusts `docker.io/*`, so prefixing the image with `docker.io/` passes admission with no policy change. Verified live: bg synth round-trips through Whisper verbatim and a full gateway /v1/talk bg turn returns a coherent spoken Bulgarian reply ("Добър ден! Добре съм, благодаря!..."). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 21:24:34 +00:00
Viktor Barzin	ab55cb5dcd	portal-stt: drop setup_tls_secret module (ClusterIP-only, no fullchain.pem) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The landed portal-stt source still declared the setup_tls_secret module + tls_secret_name variable, which file()-reads secrets/fullchain.pem — a file this stack does not ship. portal-stt is ClusterIP-only (no ingress; the Gateway is the sole externally-exposed component, ADR-0001), so it needs no TLS secret. The live deployment never had it (removed during the original apply); this aligns the source with reality so CI applies cleanly. Fixes the pipeline-229 portal-stt apply failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 20:29:31 +00:00
Viktor Barzin	e7b9a74756	portal-assistant: land voice stacks + switch TTS to edge-tts (intelligible Bulgarian) Some checks failed ci/woodpecker/push/default Pipeline failed Details The portal-assistant voice-assistant stacks (portal-tts, portal-stt, portal-assistant) were applied to the live cluster from feature branches but never landed on master — the GitOps source of truth. This lands all three and, in portal-tts, fixes Bulgarian speech. Bulgarian was unintelligible: the local Piper voice (bg_BG-dimitar-medium via espeak-ng) mangles Bulgarian consonants — a synth->Whisper round-trip turned "Добър ден" into "Обърден", and a user heard pure gibberish. English was fine. portal-tts now runs openai-edge-tts (Microsoft edge-tts neural voices) for BOTH languages instead of Piper — ADR-0003 always named edge-tts as the online Bulgarian-quality fallback. Validated before landing: edge bg round-trips through Whisper verbatim ("Добър ден! Как сте днес? ..."). The gateway maps detected language bg/en to the edge voice names via new TTS_VOICE_BG / TTS_VOICE_EN env (bg-BG-KalinaNeural / en-US-AvaNeural). No GPU, no NFS model store, no secrets — edge fetches voices from Microsoft per request (egress verified). The assistant already needs the internet for the Claude brain, so an online TTS adds no new failure mode. The brain stays Sonnet with no extended thinking (already the default — a live turn answers directly in ~3.4s), per the latency-over-smartness ask. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 20:25:29 +00:00
Viktor Barzin	677a181d49	reverse-proxy: dedicated rate limit for ha-london; bump ha-sofia (cold-client 429s) All checks were successful ci/woodpecker/push/default Pipeline was successful Details New, empty-cache clients (the repurposed Meta Portal running the HA companion app) cold-load the whole HA frontend at once - dozens of frontend_latest/*.js + MDI icon chunks. ha-london had no per-service rate limit, so it fell back to the global 10/s burst 50 and 429'd those chunks, leaving every dashboard blank (Settings, which loads less, worked). Give ha-london its own 200/500 middleware (skip_global_rate_limit, mirroring ha-sofia, with depends_on to avoid the dangling-middleware 404 window) and bump ha-sofia 100/200 -> 200/500 so a cold Portal load of Sofia doesn't hit the same wall. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 19:53:47 +00:00
Viktor Barzin	9565ff1ce5	state(infra): update encrypted state All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-17 19:50:30 +00:00
Viktor Barzin	6518e54154	create-template-vm: add k8s-upgrade pipeline SSH key to node cloud-init Some checks failed ci/woodpecker/push/default Pipeline failed Details New k8s nodes were only getting the personal `wizard` key in authorized_keys — not the automated k8s-version-upgrade pipeline's key (Vault secret/k8s-upgrade/ssh_key_pub). So a freshly provisioned node is invisible to the upgrade chain (it SSHes in as `wizard` to drain+upgrade): node4/5/6 all hit "Permission denied (publickey)" on 2026-06-17 and had to have the key pushed by hand. Bake the public key into the cloud-init template so every new node gets it on first boot. (unattended-upgrades is already in this template — node4/node5 missed it only because the LIVE PVE cloud-init snippet lagged this source: it deploys via a Tier-0 `stacks/infra` apply that hadn't run since before their 2026-05-26 provision. Same lesson applies to THIS change — it reaches new nodes only after `stacks/infra` is applied to refresh the snippet on the PVE host.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:59:59 +00:00
Viktor Barzin	aac7121ccc	t3-afk: scale to 0 — park the in-cluster T3 AFK executor (no current plans) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor has no near-term plans to use the autonomous AFK pipeline's in-cluster T3 cockpit/executor, so stop its pod to free node resources while keeping it trivially revivable. Only the deployment replica count changes (1 -> 0); the SSD PVC (state.sqlite + repo checkouts), Service, Ingress, and ExternalSecret are all left in place — reviving is just setting replicas back to 1 and applying. Already applied live via scripts/tg (PG state now 0 replicas, pod terminated); this commit syncs git so drift-detection / the next apply won't re-scale it up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:55:35 +00:00
Viktor Barzin	b931d9fb20	k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap) All checks were successful ci/woodpecker/push/default Pipeline was successful Details phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around the master upgrade so it can't crashloop during the apiserver blip + I/O-storm kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore was a plain line at the end of the phase, so any abort AFTER quiescing left the operator at 0 — and the idempotent retry then skipped the already-on-target master phase and never restored it. Observed 2026-06-17: a post-upgrade gate aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane fine — calico-node keeps running — but no Calico reconciliation). Fix: - Drain first (drain doesn't blip the apiserver), THEN quiesce right before `kubeadm upgrade apply`, and install an EXIT trap that restores the operator no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success). Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared after the explicit happy-path restore. - postflight also force-restores replicas=1 as a final guarantee (covers the skip-on-retry path that never quiesces or restores). Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:25:54 +00:00
Viktor Barzin	c04efa3d3a	k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades) Some checks failed ci/woodpecker/push/default Pipeline failed Details Disruptive node drains should run when the cluster is idle. Move the k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC (00:00 London) — overnight, low usage, and clear of the kured OS-reboot window (01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.) - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *. - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot (was next_daily_noon_utc). - docs (runbook, architecture) + upgrade-state SKILL: schedule references updated to 23:00 UTC nightly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 18:16:32 +00:00
Viktor Barzin	ed53b34bf4	k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS records, so the chain couldn't SSH to them at all. Refactor (upgrade-step.sh): - Worker set + order derived live from `kubectl get nodes` (worker_nodes / next_pending_worker), so EVERY worker still off-target is upgraded and a newly-joined node is covered with zero script change. - SSH targets are node InternalIPs (ssh_target), removing the dependency on node DNS records entirely — a new node is reachable the moment it joins. - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now enumerate workers/all-nodes dynamically too. - Topology preserved: master-drain Job runs on the first worker; every worker-drain Job runs on the already-upgraded k8s-master (self-preemption invariant intact). - next_pending_worker returns 0 explicitly on the no-match path — the `while read … done < <(…)` loop exits 1 at EOF, which under set -e would abort the LAST worker's Job before it spawns postflight (cluster upgraded but no cleanup / in_flight reset). Caught in review. Docs (runbook + architecture + headers) updated to the dynamic topology. NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was deployed to node4/5/6 by hand this session. Baking it into node provisioning (so new nodes get it automatically) is the remaining follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 16:56:02 +00:00
Viktor Barzin	0c5a9b5f44	k8s-version-upgrade: grant pods/log so preflight can verify the etcd snapshot All checks were successful ci/woodpecker/push/default Pipeline was successful Details Preflight step 6 confirms the pre-upgrade etcd snapshot is non-empty by parsing the backup Job's log (`kubectl -n default logs job/pre-upgrade-etcd-...`). The k8s-upgrade-job ClusterRole granted `pods` get/list/delete but NOT the `pods/log` subresource, so the read failed with Forbidden in the default ns and aborted preflight — after step 5 had already set k8s_upgrade_in_flight=1. A stale out-of-band grant had masked this until a `terragrunt apply` in this session reconciled the role back to its TF definition. Codify pods/log:get. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:52:52 +00:00
Viktor Barzin	bfb86e653f	k8s-version-upgrade: ignore CoreDNS preflight on `kubeadm upgrade plan` too All checks were successful ci/woodpecker/push/default Pipeline was successful Details The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's kubeadm binary is on the target version (the first attempt's apt step already bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under set -euo pipefail and abort: - preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the target version; the failing pipeline killed the whole preflight. - update_k8s.sh master path line 85 — bare `plan` before the apply. Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins. Verified read-only on master: plan exits 0 and still emits "kubeadm upgrade apply v1.34.9". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:49:06 +00:00
Viktor Barzin	037a609f27	k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's bundled corefile-migration table ("start version not supported"). - scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite our custom split-horizon Corefile with kubeadm's default AND downgrade the image; --skip-phases leaves CoreDNS 100% untouched while the control plane upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift. - stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight quiet-baseline (settle-window) check, which silently no-op'd on the ghcr claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open). - docs: runbook + architecture document the CoreDNS handling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:45:05 +00:00
Viktor Barzin	042d1ce1ac	k8s-version-upgrade: CI-retrigger to apply D1 (missed by two-commit diff-base) All checks were successful ci/woodpecker/push/default Pipeline was successful Details `fb638cd8` landed as two commits; the apply pipeline diffed against HEAD~1 (the monitoring-only commit) and never applied stacks/k8s-version-upgrade, so the retry-on-failure logic isn't live yet. This single-commit retrigger forces it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:28:58 +00:00
Viktor Barzin	fb638cd8ec	k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs Some checks failed ci/woodpecker/push/default Pipeline failed Details Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to the terminal job-condition reasons (BackoffLimitExceeded\|DeadlineExceeded). A phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every firing alert also halts kured, so a bare-count false-positive would block all OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics: the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0 for the terminal reasons. Docs updated to match the behaviour change (per the same-commit docs rule): - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the "kill a stuck Job" recovery now leads with retry-on-failure self-heal. - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert; retry-on-failure note on the deterministic-naming paragraph. - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend entry, and drill-down (also copied to the active ~/.claude copy). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:10:18 +00:00
Viktor Barzin	dfa1a12a86	k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall) The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing. On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the devvm) was firing when the daily detection ran; the preflight's "halt on any critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1). Two design gaps then turned that blip into a multi-day wedge: * the detection guard and spawn_next only checked whether the phase Job EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via ttlSecondsAfterFinished, so every daily run skipped re-spawning it; * the abort happens before the in-flight metric is pushed, so neither K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported "never ran" while actually being stuck. Fixes: D1 retry-on-failure: detection CronJob (main.tf) and spawn_next (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job instead of skipping it, so a transient gate self-corrects next cycle rather than wedging the pipeline for a week. D2 WebterminalTtydUnreachable critical -> warning: a devvm developer web-terminal is not cluster infrastructure and must not block upgrades. D3 observability: new K8sUpgradeChainJobFailed alert (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags a Failed chain Job as "chain failed" — closing the pre-in-flight blind spot so a wedge is visible immediately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:07:36 +00:00
Viktor Barzin	7e7e41cbef	fix(authentik): derive username from email in tripit-enrollment (user_write needs it) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The passwordless enrollment prompt collects only email+name, so user_write aborted with 'Aborting write to empty username' (ak-stage-access-denied). Add an expression policy on the user_write binding (evaluate_on_plan=false + re_evaluate_policies=true, like guest.tf) that sets prompt_data['username'] = the entered email before the write. Verified the failure live via the flow executor API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 07:35:23 +00:00
Viktor Barzin	e4512f3566	fix(authentik): deliver tripit email-verify stages via blueprint (provider token_expiry too old) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Pipeline 214 failed: the pinned goauthentik 2024.x provider models EmailStage.token_expiry as an integer, but the live 2026.2.x server requires a duration string ('hours=24') and 400s any number (even the provider default 30). Bumping the provider is a global terragrunt.hcl change re-applying every platform stack + breaking 3 other authentik-using stacks' lockfiles — disproportionate. Instead the two email-verification stages + their flow bindings move into an Authentik blueprint (tripit-email-stages.yaml) applied server-side via authentik_blueprint; the server parses token_expiry natively. Validated on the live server + terraform validate. Restores the ADR-0020 email-verification security gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 07:30:05 +00:00
Viktor Barzin	89eb090be3	feat(authentik): tripit-enrollment + tripit-recovery flows (passwordless signup, ADR-0020) Some checks failed ci/woodpecker/push/default Pipeline failed Details Makes the WebLanding 'Sign up' button work (it was 404ing — the tripit-enrollment flow didn't exist). Open passwordless registration: prompt(email,name) -> user_write(INACTIVE, external, group 'TripIt External') -> email verification (activates) -> passkey -> login. The inactive-until-verified gate is the security boundary: tripit trusts X-authentik-email, so activation must require proving inbox ownership. Passwordless login already works via the built-in webauthn flow. tripit-recovery (email -> new passkey) is built but intentionally NOT wired into the global brand recovery, so admin recovery is unchanged. Schema validated with terraform validate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 07:20:11 +00:00
Viktor Barzin	4bf3f504ea	fix(authentik): SMTP host = mail.viktorbarzin.me (svc name fails wildcard-cert verify) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The in-cluster svc name mailserver.mailserver.svc.cluster.local fails Authentik's strict STARTTLS hostname verification (CERTIFICATE_VERIFY_FAILED): the mailserver serves the *.viktorbarzin.me wildcard cert, which doesn't cover the svc DNS name. Use the public name mail.viktorbarzin.me, which resolves in-cluster (10.0.20.1) and matches the cert. Verified end-to-end from an authentik pod (verified TLS + SASL auth + send) before this change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 07:13:53 +00:00
Viktor Barzin	c3d0c121bb	feat(authentik): wire SMTP (noreply@) for TripIt signup verification + recovery email (ADR-0020) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Authentik email was unconfigured (localhost), so the TripIt enrollment flow's email-verification stage couldn't send. Add AUTHENTIK_EMAIL__* to server.env + worker.env pointing at the in-cluster mailserver as noreply@viktorbarzin.me (587/STARTTLS), with the SASL password synced from Vault secret/authentik.smtp_password via a new authentik-email ExternalSecret (reloader-annotated). Image pin unchanged (2026.2.4 == live). Prereq for the tripit-enrollment flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 07:04:52 +00:00
Viktor Barzin	8a2a3d9eca	Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details # Conflicts: # scripts/t3-provision-users.sh	2026-06-16 22:32:43 +00:00
Viktor Barzin	63e714782c	immich: remove one-shot anca-elements-import Job + its PVC All of Anca's photos are imported. The Job was declared as kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of the immich stack re-created it, despite the 2026-05-25 in-code comment saying "After successful completion: REMOVE this resource block + apply again." Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the 6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver. Recovery actions taken before this commit: - Throttled nfsd 64→8 on PVE host to give apiserver headroom - `kubectl delete job -n immich anca-elements-import` + force-delete pod - Restored nfsd to 64; cluster healthy Code change here: - Remove `kubernetes_job_v1.anca_elements_import` block - Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` — no live consumer; videos batch deferred per user, source dump remains on PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin) - Update 2026-05-25 post-mortem: 6th-incident section + new lesson that one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob or a runbook-captured `kubectl create job` ad-hoc invocation instead).	2026-06-16 22:11:27 +00:00
Viktor Barzin	88717c61fd	immich-frame: whole library (last 2y), Ken Burns, weather, 30s interval All checks were successful ci/woodpecker/push/default Pipeline was successful Details Per Viktor: show the whole Immich library from the last 2 years instead of the single 'china' album, enable Ken Burns pan/zoom, slow the interval to 30s, and add the weather overlay (London, metric). OpenWeatherMap key is read from Vault (secret/immich -> frame_weather_api_key), not hardcoded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 21:07:39 +00:00
Viktor Barzin	cffa32fae3	Merge remote-tracking branch 'forgejo/master' into wizard/tripit-ingest-model All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-16 20:39:30 +00:00
Viktor Barzin	14476bfbd7	tripit: mail-ingest extracts with the qwen3-8b text model, not the vision model Forwarded schedule-change emails were being parsed by qwen3vl-4b (a 4B vision model) for text extraction, which reliably dropped the flight number — so the matcher had no key to link on and a forwarded flight update created a duplicate instead of amending the existing segment. Point the ingest-plans CronJob's text extraction at qwen3-8b (verified live: it emits flight_number + a clean PNR, 3/3 on the failing email) and keep qwen3vl-4b for boarding-pass image attachments (LLM_VISION_MODEL). llama-swap loads each on demand; the GPU swap cost is accepted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:39:29 +00:00
Viktor Barzin	0a6ed4b2fe	workstation: per-user playwright browser MCP for all users, reproducible from git Viktor asked that the playwright browser MCP be available for every devvm user in every directory, with each user running their own server and multiple concurrent sessions per user. Before this, playwright was hand-set-up per user (~/.config/systemd/user/ playwright-mcp.service on 8931/8932/8933) and only wizard was actually wired — emo's and anca's servers ran but their ~/.claude.json had no playwright entry, so their Claude never connected. None of it was reproducible from git (units, refresh script, and the Vault snapshot token lived only in user homes), so a devvm rebuild would silently lose it. This makes it reproducible and fixes the unwired users: - roster_engine.py: sticky per-user PLAYWRIGHT_PORT (PLAYWRIGHT_BASE_PORT=8931, allocated for every roster user incl. the admin), emitted in the derive JSON. - scripts/workstation/playwright/: system-level TEMPLATE units (playwright-mcp@.service + playwright-snapshot-refresh@.{service,timer}, User=%i — system manager, so no systemd --user / linger) + the refresh script. @playwright/mcp pinned to 0.0.76 (avoids the @latest silent-fleet-roll footgun, same rationale as T3_PIN). - setup-devvm.sh: install the templates + script (9e); stage the chrome-service snapshot bearer token from Vault to a root file (8c) — the hourly root reconcile has no Vault token, mirrors the Claude OAuth staging in 8a. - t3-provision-users.sh: install_playwright() (ALL tiers incl. admin) writes PLAYWRIGHT_PORT, seeds the token if-absent, wires the user-scope ~/.claude.json by running `claude mcp add` AS the user (clobber-proof + if-absent, so it fixes existing/new/admin without rewriting a populated config), and enable --now's the instances (idempotent, never restarts a running server). Also hardened the section-1 .env scan to skip the new playwright-.env files (no T3_PORT -> grep no-match would abort under set -e -o pipefail). - Docs: chrome-service-snapshot runbook (new Provisioning section + system-unit commands), multi-tenancy.md, and the 2026-06-07 plan Task 2.3. Supersedes the hand-made per-user --user units (one-time idle-gated migration to follow on the live host). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:33:47 +00:00
Viktor Barzin	c6a5cbe227	feat(tripit): serve the SPA publicly, keep /api + /metrics forward-auth-gated (ADR-0020 landing) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details The website 302'd unauthenticated visitors straight to Authentik. Split the tripit.viktorbarzin.me ingress: the SPA shell (everything else) becomes auth=none so the app shows its own Log in / Sign up landing page, while a new tripit-app-api ingress keeps /api + /metrics behind forward-auth — the security boundary, since /api trusts the outpost-injected X-authentik-email. The public SPA gets strip-auth-headers (no spoofed headers can reach the backend) and anti_ai_scraping=false (it's an installable PWA). The existing auth=none carve-outs (calendar, emails/confirm, planner/slack) are longer prefixes and keep winning. Pairs with the tripit landing-page deploy (commit 3fe4da1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:30:58 +00:00
github-actions[bot]	eb47eb1d10	priority-pass: bump image_tag to 63e118c3 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `63e118c334`	2026-06-16 17:45:33 +00:00
github-actions[bot]	d1f2e50736	priority-pass: bump image_tag to 4ce9e8e8 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `4ce9e8e894`	2026-06-16 17:44:40 +00:00
github-actions[bot]	46b5f04f67	priority-pass: bump image_tag to 63e118c3 [ci skip] Auto-committed by ViktorBarzin/priority-pass GHA on push to main. Source: `63e118c334`	2026-06-16 17:20:08 +00:00

1 2 3 4 5 ...

4379 commits