Commit graph

1443 commits

Author SHA1 Message Date
Viktor Barzin
0c5a9b5f44 k8s-version-upgrade: grant pods/log so preflight can verify the etcd snapshot
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Preflight step 6 confirms the pre-upgrade etcd snapshot is non-empty by parsing
the backup Job's log (`kubectl -n default logs job/pre-upgrade-etcd-...`). The
k8s-upgrade-job ClusterRole granted `pods` get/list/delete but NOT the `pods/log`
subresource, so the read failed with Forbidden in the default ns and aborted
preflight — after step 5 had already set k8s_upgrade_in_flight=1. A stale
out-of-band grant had masked this until a `terragrunt apply` in this session
reconciled the role back to its TF definition. Codify pods/log:get.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:52:52 +00:00
Viktor Barzin
bfb86e653f k8s-version-upgrade: ignore CoreDNS preflight on kubeadm upgrade plan too
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade
apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's
kubeadm binary is on the target version (the first attempt's apt step already
bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under
set -euo pipefail and abort:
  - preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the
    target version; the failing pipeline killed the whole preflight.
  - update_k8s.sh master path line 85 — bare `plan` before the apply.

Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins.
Verified read-only on master: plan exits 0 and still emits
"kubeadm upgrade apply v1.34.9".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:49:06 +00:00
Viktor Barzin
037a609f27 k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").

- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
  `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
  --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
  our custom split-horizon Corefile with kubeadm's default AND downgrade the
  image; --skip-phases leaves CoreDNS 100% untouched while the control plane
  upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
  quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
  claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
  GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:45:05 +00:00
Viktor Barzin
042d1ce1ac k8s-version-upgrade: CI-retrigger to apply D1 (missed by two-commit diff-base)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
fb638cd8 landed as two commits; the apply pipeline diffed against HEAD~1 (the
monitoring-only commit) and never applied stacks/k8s-version-upgrade, so the
retry-on-failure logic isn't live yet. This single-commit retrigger forces it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:28:58 +00:00
Viktor Barzin
fb638cd8ec k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.

Docs updated to match the behaviour change (per the same-commit docs rule):
  - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
    "kill a stuck Job" recovery now leads with retry-on-failure self-heal.
  - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
    retry-on-failure note on the deterministic-naming paragraph.
  - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
    entry, and drill-down (also copied to the active ~/.claude copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00
Viktor Barzin
dfa1a12a86 k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall)
The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing.
On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the
devvm) was firing when the daily detection ran; the preflight's "halt on any
critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1).
Two design gaps then turned that blip into a multi-day wedge:

  * the detection guard and spawn_next only checked whether the phase Job
    EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via
    ttlSecondsAfterFinished, so every daily run skipped re-spawning it;
  * the abort happens before the in-flight metric is pushed, so neither
    K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported
    "never ran" while actually being stuck.

Fixes:
  D1 retry-on-failure: detection CronJob (main.tf) and spawn_next
     (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job
     instead of skipping it, so a transient gate self-corrects next cycle
     rather than wedging the pipeline for a week.
  D2 WebterminalTtydUnreachable critical -> warning: a devvm developer
     web-terminal is not cluster infrastructure and must not block upgrades.
  D3 observability: new K8sUpgradeChainJobFailed alert
     (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags
     a Failed chain Job as "chain failed" — closing the pre-in-flight blind
     spot so a wedge is visible immediately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:07:36 +00:00
Viktor Barzin
7e7e41cbef fix(authentik): derive username from email in tripit-enrollment (user_write needs it)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The passwordless enrollment prompt collects only email+name, so user_write aborted with 'Aborting write to empty username' (ak-stage-access-denied). Add an expression policy on the user_write binding (evaluate_on_plan=false + re_evaluate_policies=true, like guest.tf) that sets prompt_data['username'] = the entered email before the write. Verified the failure live via the flow executor API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:35:23 +00:00
Viktor Barzin
e4512f3566 fix(authentik): deliver tripit email-verify stages via blueprint (provider token_expiry too old)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 214 failed: the pinned goauthentik 2024.x provider models EmailStage.token_expiry as an integer, but the live 2026.2.x server requires a duration string ('hours=24') and 400s any number (even the provider default 30). Bumping the provider is a global terragrunt.hcl change re-applying every platform stack + breaking 3 other authentik-using stacks' lockfiles — disproportionate. Instead the two email-verification stages + their flow bindings move into an Authentik blueprint (tripit-email-stages.yaml) applied server-side via authentik_blueprint; the server parses token_expiry natively. Validated on the live server + terraform validate. Restores the ADR-0020 email-verification security gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:30:05 +00:00
Viktor Barzin
89eb090be3 feat(authentik): tripit-enrollment + tripit-recovery flows (passwordless signup, ADR-0020)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Makes the WebLanding 'Sign up' button work (it was 404ing — the tripit-enrollment flow didn't exist). Open passwordless registration: prompt(email,name) -> user_write(INACTIVE, external, group 'TripIt External') -> email verification (activates) -> passkey -> login. The inactive-until-verified gate is the security boundary: tripit trusts X-authentik-email, so activation must require proving inbox ownership. Passwordless login already works via the built-in webauthn flow. tripit-recovery (email -> new passkey) is built but intentionally NOT wired into the global brand recovery, so admin recovery is unchanged. Schema validated with terraform validate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:20:11 +00:00
Viktor Barzin
4bf3f504ea fix(authentik): SMTP host = mail.viktorbarzin.me (svc name fails wildcard-cert verify)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The in-cluster svc name mailserver.mailserver.svc.cluster.local fails Authentik's strict STARTTLS hostname verification (CERTIFICATE_VERIFY_FAILED): the mailserver serves the *.viktorbarzin.me wildcard cert, which doesn't cover the svc DNS name. Use the public name mail.viktorbarzin.me, which resolves in-cluster (10.0.20.1) and matches the cert. Verified end-to-end from an authentik pod (verified TLS + SASL auth + send) before this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:13:53 +00:00
Viktor Barzin
c3d0c121bb feat(authentik): wire SMTP (noreply@) for TripIt signup verification + recovery email (ADR-0020)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Authentik email was unconfigured (localhost), so the TripIt enrollment flow's email-verification stage couldn't send. Add AUTHENTIK_EMAIL__* to server.env + worker.env pointing at the in-cluster mailserver as noreply@viktorbarzin.me (587/STARTTLS), with the SASL password synced from Vault secret/authentik.smtp_password via a new authentik-email ExternalSecret (reloader-annotated). Image pin unchanged (2026.2.4 == live). Prereq for the tripit-enrollment flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:04:52 +00:00
Viktor Barzin
8a2a3d9eca Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
# Conflicts:
#	scripts/t3-provision-users.sh
2026-06-16 22:32:43 +00:00
Viktor Barzin
63e714782c immich: remove one-shot anca-elements-import Job + its PVC
All of Anca's photos are imported. The Job was declared as
kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of
the immich stack re-created it, despite the 2026-05-25 in-code comment saying
"After successful completion: REMOVE this resource block + apply again."
Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the
6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan
mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver.

Recovery actions taken before this commit:
  - Throttled nfsd 64→8 on PVE host to give apiserver headroom
  - `kubectl delete job -n immich anca-elements-import` + force-delete pod
  - Restored nfsd to 64; cluster healthy

Code change here:
  - Remove `kubernetes_job_v1.anca_elements_import` block
  - Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` —
    no live consumer; videos batch deferred per user, source dump remains on
    PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin)
  - Update 2026-05-25 post-mortem: 6th-incident section + new lesson that
    one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob
    or a runbook-captured `kubectl create job` ad-hoc invocation instead).
2026-06-16 22:11:27 +00:00
Viktor Barzin
88717c61fd immich-frame: whole library (last 2y), Ken Burns, weather, 30s interval
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Per Viktor: show the whole Immich library from the last 2 years instead of the
single 'china' album, enable Ken Burns pan/zoom, slow the interval to 30s, and
add the weather overlay (London, metric). OpenWeatherMap key is read from Vault
(secret/immich -> frame_weather_api_key), not hardcoded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 21:07:39 +00:00
Viktor Barzin
cffa32fae3 Merge remote-tracking branch 'forgejo/master' into wizard/tripit-ingest-model
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-16 20:39:30 +00:00
Viktor Barzin
14476bfbd7 tripit: mail-ingest extracts with the qwen3-8b text model, not the vision model
Forwarded schedule-change emails were being parsed by qwen3vl-4b (a 4B *vision*
model) for text extraction, which reliably dropped the flight number — so the
matcher had no key to link on and a forwarded flight update created a duplicate
instead of amending the existing segment.

Point the ingest-plans CronJob's text extraction at qwen3-8b (verified live: it
emits flight_number + a clean PNR, 3/3 on the failing email) and keep qwen3vl-4b
for boarding-pass image attachments (LLM_VISION_MODEL). llama-swap loads each on
demand; the GPU swap cost is accepted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:39:29 +00:00
Viktor Barzin
c6a5cbe227 feat(tripit): serve the SPA publicly, keep /api + /metrics forward-auth-gated (ADR-0020 landing)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The website 302'd unauthenticated visitors straight to Authentik. Split the tripit.viktorbarzin.me ingress: the SPA shell (everything else) becomes auth=none so the app shows its own Log in / Sign up landing page, while a new tripit-app-api ingress keeps /api + /metrics behind forward-auth — the security boundary, since /api trusts the outpost-injected X-authentik-email. The public SPA gets strip-auth-headers (no spoofed headers can reach the backend) and anti_ai_scraping=false (it's an installable PWA). The existing auth=none carve-outs (calendar, emails/confirm, planner/slack) are longer prefixes and keep winning. Pairs with the tripit landing-page deploy (commit 3fe4da1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:30:58 +00:00
github-actions[bot]
eb47eb1d10 priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-16 17:45:33 +00:00
github-actions[bot]
d1f2e50736 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-06-16 17:44:40 +00:00
github-actions[bot]
46b5f04f67 priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-16 17:20:08 +00:00
github-actions[bot]
29ad200026 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-06-16 17:19:55 +00:00
Viktor Barzin
f4f7705127 monitoring: adopt orphaned alert-digest resources into TF state (unblocks apply)
The monitoring stack apply was create-failing on every push with `configmaps
"alert-digest-script" already exists` + `secrets "alert-digest" already exists`
(modules/monitoring/alert_digest.tf) — both resources exist in-cluster but fell
out of Terraform state, so apply tried to CREATE them and errored. Pre-existing
(failed on pipelines 203 AND 204, NOT caused by the t3 alert-rules change). Add
import {} blocks (TF 1.5+ adoption per AGENTS.md) so apply imports + reconciles
instead of failing. Idempotent once imported; safe to remove after a green apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:31:17 +00:00
Viktor Barzin
994d305d04 t3: session-auth detection for the gated nightly tracker (dispatch fallback logging + Loki alerts)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Before auto-tracking t3 nightly builds (Viktor's call, risk accepted), stand up
the detection that was missing on 2026-06-09 — when an auto-pulled nightly broke
pairing for ALL users and nothing alerted. Viktor's explicit requirement: make
sure session auth keeps working and revert if the pairing fallback/failure rate
climbs. This is phase 0 (detection) of that work.

- t3-dispatch: exchangeCredential now reports WHICH pairing endpoint answered,
  and autoPair logs every outcome (paired user=.. endpoint=.. fallback=..) — so
  the real-user browser-session->bootstrap fallback rate is observable. A
  non-zero rate flags that a build moved the pairing API (the 2026-06-09 class).
- Loki ruler alerts (devvm journal -> Alertmanager -> Slack): T3PairingBroken
  (real users failing to pair), T3PairFallbackHigh (build moved the pairing API),
  T3AutoUpdateRolledBack / RollbackFailed / Frozen (enforcer outcomes). Closes
  the post-mortem's open "nothing monitors end-to-end pairing" detection gap.

The existing t3-probe only checks GET /api/auth/session==200, which stays 200
even when pairing is dead, so it never caught the outage class.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:56:55 +00:00
Viktor Barzin
e783cae2cb chrome-service + mam-farming: doc clarifications (+ re-trigger CI apply missed earlier)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two small doc additions that also re-include these stacks in Woodpecker's
changed-stack detection. The earlier 2-commit push left chrome-service out of the
HEAD~1..HEAD diff so its ignore_changes fix never applied; the monitoring apply was
separately blocked by a stuck prometheus pending-upgrade (now cleared).

- chrome-service: note the live pod's container order had drifted from this file's
  order, so a TF apply reorders them (containers[0] differs live-vs-TF until the
  apply lands) -- documents the confusion this caused during diagnosis.
- mam-farming: cross-ref the grabber script that emits mam_grabber_last_run_timestamp.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:34:23 +00:00
Viktor Barzin
2479560fa2 mam-farming: make MAMFarmingStuck a grabber heartbeat, not a grab-count check
Some checks failed
ci/woodpecker/push/default Pipeline failed
MAMFarmingStuck fired whenever the freeleech grabber added 0 torrents in 4h, but
grabbing 0 is normal: the grabber searches a random catalogue offset each run and
legitimately finds nothing when freeleech is dry (account ratio was a healthy
37.5; the alert even misreported it as "0.00" because $value was the grabbed
count, not the ratio). The alert's real intent was to catch the grabber not
running at all (CronJob Forbid-blocked / wedged), but increase(grabbed[4h])==0
cannot distinguish "didn't run" from "ran, nothing to grab" since Pushgateway
serves the last pushed value forever.

The grabber now heartbeats mam_grabber_last_run_timestamp on every completed run
(main success, ratio/mouse skip, and qBittorrent-unreachable paths). The alert
fires only when that heartbeat is >4h stale — the true stuck condition. Cookie
expiry and qBittorrent-down keep their own dedicated alerts.

Surfaced by /cluster-health as a false-firing alert.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:18:33 +00:00
Viktor Barzin
a0725ede57 chrome-service: stop ignoring container[0].image so TF re-asserts the pinned browser image
The chrome-service container (container[0]) runs the pinned Microsoft Playwright
image, which ships chromium under /ms-playwright. Its image was still listed in
the deployment's lifecycle ignore_changes — a leftover KEEL_IGNORE from before
ADR-0002 #29 moved the novnc container to TF management. With that field ignored,
a stray clobber of container[0] to ghcr chrome-service-novnc:latest (which has no
chromium there) stuck permanently: the container crash-looped ~12h on "chromium
binary not found under /ms-playwright" (273 restarts) and TF could not revert it.

Remove container[0].image from ignore_changes so Terraform pins it to local.image
and re-asserts it on every apply. Both containers are TF-managed now (novnc since
ADR-0002 #29); Keel is inert (policy=never), so nothing should fight TF here.

Surfaced by /cluster-health. Live state was already restored transiently via
kubectl set image; this commit makes the fix durable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:18:32 +00:00
Viktor Barzin
57d45d8d8f fix(authentik): pin Vault binding UUIDs as literals (provider has no authentik_application data source)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
CI pipeline 198 failed: the pinned goauthentik/authentik provider has no data "authentik_application" source, so terraform failed the whole authentik plan and applied NOTHING (state unchanged). Replace the data-source lookups with the live pbm_uuid (Vault app) and group_uuid (Allow Login Users) as literals; authentik_policy_binding is supported (used in guest.tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:01:29 +00:00
Viktor Barzin
aa461b95bc feat(authentik): bind Vault OIDC app to Allow Login Users (close ADR-0020 OIDC gap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Audit found the Vault Authentik application had no authorization binding, so any authenticated identity (incl. a future self-enrolled TripIt External user) could complete Vault OIDC login and get a built-in default-policy token. Bind it to 'Allow Login Users' — existing homelab users inherit that group via its children (verified User.all_groups() includes the parent), parentless TripIt External users are excluded. Closes the only OIDC app the forward-auth fence does not cover.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 21:48:04 +00:00
Viktor Barzin
cbca281aaa feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020)
Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 21:48:04 +00:00
Viktor Barzin
5d3a166b94 t3-afk: fix agent Bash — stop mounting into ~/.claude
Some checks failed
ci/woodpecker/push/default Pipeline failed
Root cause of "the agent never commits": the issue-implementer CLAUDE.md was
subPath-mounted at /home/node/.claude/CLAUDE.md, which made /home/node/.claude
root-owned. The agent (uid 1000) then couldn't create its Bash session-env
there, so EVERY Bash/git call failed (Write/Edit worked, so it silently edited
but never committed). Found by reading the agent transcripts from
state.sqlite -> projection_thread_messages.

Fix: don't mount anything into ~/.claude (it's not honored by T3's SDK anyway).
Behaviour is injected via the dispatch message preamble by the control plane;
files/issue-implementer-CLAUDE.md kept as the canonical source text.

Verified post-fix: a preamble-dispatched task edited README and COMMITTED
(073ab28) unattended.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:49:34 +00:00
Viktor Barzin
34c30ac2bf t3-afk: auto-pair dispatcher sidecar — no manual pairing
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The bare `t3 serve` behind Authentik showed the manual /pair#token screen, which
didn't connect. Mirror the devvm t3-dispatch: a small stdlib-Node sidecar fronts
t3 serve, and on a cookieless (already Authentik-gated) document load it mints a
pairing credential (`t3 auth pairing create`) and exchanges it at
/api/auth/browser-session for the t3_session cookie, then 302s back. Everything
else — including WebSocket upgrades for the live cockpit — reverse-proxies to
:3773. The Service now targets the sidecar (:8080).

Verified: cookieless GET -> 302 + Set-Cookie t3_session; cookied GET -> 200 SPA.
Matches the t3.viktorbarzin.me experience (Authentik login -> straight into the
cockpit).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:19:39 +00:00
Viktor Barzin
82a0c5aedf t3-afk: fix crashloop — exclude from Keel at the deployment level
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Keel "patch"-downgraded the image docker.io/library/node:24 -> library/node:24.0.2,
which is below t3@0.0.27's required node >=24.10, so `t3 serve` exited silently and
the pod crash-looped (~160 restarts / 13h).

Root cause: keel.sh/policy=never was on the POD-TEMPLATE labels, but Keel reads the
policy at the DEPLOYMENT level. The cluster's Kyverno inject-keel-annotations is
opt-out, so it stamped policy=patch and Keel acted on it.

Fix: set keel.sh/policy=never as a deployment-level annotation; ignore_changes the
Kyverno-injected keel.sh/pollSchedule + keel.sh/trigger annotations; the image stays
TF-owned (apply reverted Keel's downgrade). Pod now 1/1, t3 serve 200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 10:32:38 +00:00
Viktor Barzin
214638216b fix(anisette): wait_for_rollout=false so a slow first start can't strand the deploy out of state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The docker.io fix created the deployment, but wait_for_rollout (default true)
then hung on the OOMing pod and the apply failed — leaving the deployment in
the cluster but NOT in terraform state, so every later apply hit
'deployments.apps "anisette" already exists'. Deleted that orphan and set
wait_for_rollout=false (mirrors tts/llama-cpp slow-start services); readiness
probe still gates Service traffic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:56:30 +00:00
Viktor Barzin
d8c60d7ab8 t3-afk: dedicated in-cluster T3 Code instance (AFK executor + cockpit)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Slice #2 of claude-agent-service PRD #1 (AFK implementation pipeline). Dedicated
in-cluster T3 Code instance the control plane dispatches issues into; runs the
issue-implementer agent in a git worktree with a live cockpit. Applied + live
2026-06-14 (9 resources).

Pilot-fast: stock docker.io/library/node:24 + install pinned t3@0.0.27 + Claude
CLI at startup onto an SSD-NFS PVC. Authentik-gated ingress. issue-implementer
behaviour ships as a user-level ~/.claude/CLAUDE.md (T3 hardcodes the system
prompt; settingSources loads it) and forbids plan-mode/clarifying-questions so
unattended threads don't stall. Keel-excluded (ADR 0003). wait_for_rollout=false
(slow first start). Image fully-qualified for the Kyverno trusted-registries
allowlist; container mem limit 4Gi (tier-aux LimitRange cap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:06:33 +00:00
Viktor Barzin
bc7b28244f fix(anisette): raise memory limit to 512Mi — 128Mi OOMKilled at startup
Some checks failed
ci/woodpecker/push/default Pipeline failed
The pod CrashLooped with OOMKilled (exit 137): anisette downloads and
initializes Apple's CoreADI provisioning library on startup, spiking past the
128Mi limit before it can bind :6969 (empty logs, liveness 'connection
refused'). Bump request 256Mi / limit 512Mi; steady state is much lower.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:54:13 +00:00
Viktor Barzin
96addf65b4 fix(anisette): docker.io/ image prefix to pass Kyverno require-trusted-registries
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
First apply was denied at admission — a bare dadoum/anisette-v3-server@sha256
ref isn't in the trusted-registries allowlist (only enumerated DockerHub
user-repo prefixes are). docker.io/* IS allowlisted, so use the explicit
registry prefix; still pulls via the 10.0.20.10 pull-through cache.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:47:05 +00:00
Viktor Barzin
0bfa6f0774 feat(anisette): self-hosted Apple anisette server for SideStore (infra #40)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Deploy a small stateless anisette-data server so the TripIt iOS Shell can be
sideloaded with SideStore using a free Apple ID, without brokering the
Apple-ID auth dance through a public third-party anisette server (which would
see every login). SideStore points at a stable internal endpoint we control.

- Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server
  for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub
  releases / semver / sha tags), so pinned by manifest digest instead of a tag
  per the "never :latest" rule. Pulled from DockerHub via the registry-VM
  pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so
  a new upstream build prompts a digest re-pin.
- Stateless: emptyDir backs the provisioning-library cache dir (regenerable
  download; upstream issue #23 means it doesn't preserve client auth across
  restarts anyway) — no PVC, no Vault secret.
- Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none,
  allow_local_access_only, ssl_redirect off) — SideStore is a native client
  that can't do the Authentik cookie dance, same reasoning as android-emulator's
  adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never
  publicly exposed.

Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service
catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:35:57 +00:00
Viktor Barzin
fe1f8d62e7 tripit: re-apply tripit stack to land CITY_IMAGE_PROVIDER=wikipedia
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The commit that enabled real city cover photos (a69847a0,
CITY_IMAGE_PROVIDER=wikipedia, #47) was committed to master but its CI run
skipped the tripit stack apply (changed-stack diff race — same class as the
prior "re-apply after pipeline race" fixes). The env never landed in-cluster,
so the provider stayed on its fake 1x1-PNG default and every trip/stay cover
rendered blank/placeholder in prod. This comment touch forces CI to re-apply
the tripit stack; terraform then reconciles the drift (desired HCL already
has the env) so the deployment picks up CITY_IMAGE_PROVIDER=wikipedia.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:45:07 +00:00
Viktor Barzin
2df6ebf305 health: fix middleware ref namespace prefix (restore site from 404)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
My previous commit referenced the new limiter as `health-rate-limit@kubernetescrd`,
omitting the namespace prefix. Traefik CRD middleware refs are
`<namespace>-<name>@kubernetescrd`, and the Middleware lives in the `traefik` ns,
so the router couldn't resolve it — Traefik failed the whole
health.viktorbarzin.me router and returned 404 on every path (the app + pod were
healthy throughout; verified via port-forward).

Correct it to `traefik-health-rate-limit@kubernetescrd`, matching the working
traefik-tripit-rate-limit / traefik-actualbudget-rate-limit references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:43:08 +00:00
Viktor Barzin
086ff85911 health: dedicated 100/1000 rate limit for the redesigned SPA
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor hit 429s browsing the redesigned health app. The default shared limiter
is 10 req/s / burst 50, but each page load is the shell (JS chunks + two
self-hosted Geist woff2) plus a 5-8 call API burst, so fast tab-to-tab
navigation from one client IP overruns burst 50 — Traefik 429s the tail and the
affected cards/pages render empty.

Give health its own limiter (average 100, burst 1000) and skip the default,
exactly as tripit/immich/actualbudget/ha-sofia already do for the same
parallel-burst pattern. Attached via the ingress_factory escape hatch
(skip_default_rate_limit + extra_middlewares).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:03:51 +00:00
Viktor Barzin
6dc77f4612 uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 09:11:22 +00:00
Viktor Barzin
05bec26d09 health: internal test-access ingress + DEV_AUTH_EMAIL (ADR-0008)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add health-test.viktorbarzin.lan (auth=none, allow_local_access_only,
anti-AI off) pointing at the same health deployment, plus a
DEV_AUTH_EMAIL=vbarzin@gmail.com env on the container. Lets automated
E2E / Playwright / manual screenshots reach the live app without the
Authentik SSO redirect, for testing — while the public
health.viktorbarzin.me ingress stays auth=required (forward-auth fails
closed, so the public path always carries the real X-authentik-email
header and never hits the DEV_AUTH_EMAIL fallback). LAN-only, no public
exposure. Decision recorded in health repo ADR-0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 04:02:34 +00:00
Viktor Barzin
e6699ed20b uptime-kuma: retry Kuma login in monitor-sync jobs (intermittent socket.io timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal + external monitor-sync CronJobs intermittently failed with socketio.exceptions.TimeoutError on api.login(), firing JobFailed -> Slack noise (and leaving monitor sync stale). Kuma 2.3.2 itself is healthy (1/1, 30m CPU); its single Node event loop just briefly stalls under ~300 monitors so the socket.io login handshake occasionally exceeds the client timeout. Wrap connect+login in a 5-attempt / 15s-backoff retry (disconnecting the half-open client between tries) so a transient stall no longer fails the whole job. Applied to both sync scripts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 20:54:14 +00:00
Viktor Barzin
a6381b8cf8 forgejo: custom 8Gi ResourceQuota (was pegged at the 4Gi tier cap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Yesterday's Forgejo 3Gi->4Gi OOM fix pushed its tier-3-edge namespace quota (requests.memory=4Gi) to 100%, firing KubeQuotaAlmostFull + the healthcheck resourcequota check. Forgejo is the git + OCI-registry backbone and legitimately needs ~4Gi, so the edge tier's 4Gi ceiling is too tight. Opt the namespace out of the auto tier quota (resource-governance/custom-quota=true) and define a forgejo-specific ResourceQuota at requests.memory=8Gi, so the 4Gi pod sits at ~50% with headroom. Same opt-out pattern dbaas uses. Re-tiering was rejected: tier 1-cluster is also 4Gi, and 0-core (8Gi) would over-classify Forgejo's priority/eviction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 17:16:47 +00:00
Viktor Barzin
25a39fd54e k8s-portal: wire private-ghcr pull (allowlist + imagePullSecrets)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-portal was the last in-cluster image build; it now builds on GHA and
pushes ghcr.io/viktorbarzin/k8s-portal:latest, which is PRIVATE (infra repo
default). To pull it: add k8s-portal to the sync-ghcr-credentials Kyverno
allowlist (clones the ghcr-credentials Secret into the namespace) and
reference that secret via imagePullSecrets on the deployment — same wiring
as tripit/recruiter-responder. Completes the no-local-builds migration so
nothing builds container images on the cluster anymore (ADR-0002).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:38:42 +00:00
Viktor Barzin
a7d33abec9 k8s-portal: commit package.json + lock (force; was gitignored) — unblocks GHA build
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build k8s-portal / build (push) Has been cancelled
Recovered the real manifest + resolved lockfile (lockfileVersion 3, 71 pkgs)
from the running pod. A parent .gitignore force-ignored package.json, so the
git source tree was incomplete and the image only ever built manually. Now
reproducible on GHA (ADR-0002 no-local-builds).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:29:27 +00:00
Viktor Barzin
a9b08c03cf fix(k8s-portal): npm install (no committed lockfile) so GHA can build
Some checks are pending
Build k8s-portal / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
package-lock.json was never committed to either lineage — npm ci needs it,
so the build only ever worked from a manual devvm build with a local lock.
npm install resolves from package.json, unblocking the GHA build (ADR-0002).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:26:42 +00:00
Viktor Barzin
b906f61ac3 k8s-portal: build off-infra GHA -> ghcr + Keel; remove Woodpecker build (no-local-builds)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The last in-cluster image build. GHA build-k8s-portal.yml builds
ghcr.io/viktorbarzin/k8s-portal:latest+sha (path-filtered on the Dockerfile
dir); Keel (force/poll/match-tag) rolls the deployment. Stack image repointed
to ghcr (ignore_changed); .woodpecker/k8s-portal.yml deleted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 15:21:35 +00:00
Viktor Barzin
9501da81a0 dbaas: document postgresql-backup startingDeadlineSeconds rationale
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Inline note on why the four backup CronJobs moved 10s->600s (bda1bdcb): a 10s deadline silently dropped the 2026-06-13 midnight full-backup run, firing PostgreSQLBackupStale. bda1bdcb rode in the same push as a forgejo change that failed CI on a namespace-quota error, so that pipeline failed before the dbaas apply took effect (live deadline was still 10s). This dbaas-only commit re-triggers the dbaas apply at a clean master so the 600s deadline actually goes live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:22:24 +00:00
Viktor Barzin
ba72621e52 forgejo: 6Gi exceeded namespace quota, set to 4Gi (quota ceiling)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The 3Gi->6Gi bump in ff3cc44a was rejected by the forgejo namespace tier-quota (requests.memory capped at 4Gi). With Guaranteed QoS the 6Gi request exceeded quota; FailedCreate left forgejo with 0 pods for ~6 min (git remote + OCI registry outage) until I patched the live Deployment back to a schedulable 4Gi. 4Gi is the most the quota allows and is still a headroom bump over the OOM-prone 3Gi. To go higher the tier-quota must be raised in the same change. This reconciles TF to the live 4Gi so the pending/next apply is a no-op rather than reverting to the quota-busting 6Gi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 14:13:36 +00:00