Commit graph

1449 commits

Author SHA1 Message Date
Viktor Barzin
e7b9a74756 portal-assistant: land voice stacks + switch TTS to edge-tts (intelligible Bulgarian)
Some checks failed
ci/woodpecker/push/default Pipeline failed
The portal-assistant voice-assistant stacks (portal-tts, portal-stt,
portal-assistant) were applied to the live cluster from feature branches but
never landed on master — the GitOps source of truth. This lands all three and,
in portal-tts, fixes Bulgarian speech.

Bulgarian was unintelligible: the local Piper voice (bg_BG-dimitar-medium via
espeak-ng) mangles Bulgarian consonants — a synth->Whisper round-trip turned
"Добър ден" into "Обърден", and a user heard pure gibberish. English was fine.

portal-tts now runs openai-edge-tts (Microsoft edge-tts neural voices) for BOTH
languages instead of Piper — ADR-0003 always named edge-tts as the online
Bulgarian-quality fallback. Validated before landing: edge bg round-trips
through Whisper verbatim ("Добър ден! Как сте днес? ..."). The gateway maps
detected language bg/en to the edge voice names via new TTS_VOICE_BG /
TTS_VOICE_EN env (bg-BG-KalinaNeural / en-US-AvaNeural). No GPU, no NFS model
store, no secrets — edge fetches voices from Microsoft per request (egress
verified). The assistant already needs the internet for the Claude brain, so an
online TTS adds no new failure mode.

The brain stays Sonnet with no extended thinking (already the default — a live
turn answers directly in ~3.4s), per the latency-over-smartness ask.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 20:25:29 +00:00
Viktor Barzin
677a181d49 reverse-proxy: dedicated rate limit for ha-london; bump ha-sofia (cold-client 429s)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
New, empty-cache clients (the repurposed Meta Portal running the HA companion
app) cold-load the whole HA frontend at once - dozens of frontend_latest/*.js +
MDI icon chunks. ha-london had no per-service rate limit, so it fell back to the
global 10/s burst 50 and 429'd those chunks, leaving every dashboard blank
(Settings, which loads less, worked). Give ha-london its own 200/500 middleware
(skip_global_rate_limit, mirroring ha-sofia, with depends_on to avoid the
dangling-middleware 404 window) and bump ha-sofia 100/200 -> 200/500 so a cold
Portal load of Sofia doesn't hit the same wall.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 19:53:47 +00:00
Viktor Barzin
aac7121ccc t3-afk: scale to 0 — park the in-cluster T3 AFK executor (no current plans)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor has no near-term plans to use the autonomous AFK pipeline's in-cluster T3
cockpit/executor, so stop its pod to free node resources while keeping it
trivially revivable. Only the deployment replica count changes (1 -> 0); the SSD
PVC (state.sqlite + repo checkouts), Service, Ingress, and ExternalSecret are all
left in place — reviving is just setting replicas back to 1 and applying.

Already applied live via scripts/tg (PG state now 0 replicas, pod terminated);
this commit syncs git so drift-detection / the next apply won't re-scale it up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:55:35 +00:00
Viktor Barzin
b931d9fb20 k8s-version-upgrade: make tigera-operator restore crash-safe (EXIT trap)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
phase_master quiesces tigera-operator (Calico's config reconciler) to 0 around
the master upgrade so it can't crashloop during the apiserver blip + I/O-storm
kubeadm's static-pod-hash watch (which would roll the upgrade back). The restore
was a plain line at the end of the phase, so any abort AFTER quiescing left the
operator at 0 — and the idempotent retry then skipped the already-on-target
master phase and never restored it. Observed 2026-06-17: a post-upgrade gate
aborted the master attempt; the operator sat scaled to 0 for ~1.5h (data plane
fine — calico-node keeps running — but no Calico reconciliation).

Fix:
  - Drain first (drain doesn't blip the apiserver), THEN quiesce right before
    `kubeadm upgrade apply`, and install an EXIT trap that restores the operator
    no matter how the phase exits (gate abort, set -e on ssh/kubeadm, success).
    Trap is set AFTER drain_node so its own EXIT trap can't clobber it; cleared
    after the explicit happy-path restore.
  - postflight also force-restores replicas=1 as a final guarantee (covers the
    skip-on-retry path that never quiesces or restores).

Long-term fix remains HA control plane (apiserver never goes down) — bead code-n0ow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:25:54 +00:00
Viktor Barzin
c04efa3d3a k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)

  - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
  - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
    (was next_daily_noon_utc).
  - docs (runbook, architecture) + upgrade-state SKILL: schedule references
    updated to 23:00 UTC nightly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:16:32 +00:00
Viktor Barzin
ed53b34bf4 k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by
FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would
have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS
records, so the chain couldn't SSH to them at all.

Refactor (upgrade-step.sh):
  - Worker set + order derived live from `kubectl get nodes` (worker_nodes /
    next_pending_worker), so EVERY worker still off-target is upgraded and a
    newly-joined node is covered with zero script change.
  - SSH targets are node InternalIPs (ssh_target), removing the dependency on
    node DNS records entirely — a new node is reachable the moment it joins.
  - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now
    enumerate workers/all-nodes dynamically too.
  - Topology preserved: master-drain Job runs on the first worker; every
    worker-drain Job runs on the already-upgraded k8s-master (self-preemption
    invariant intact).
  - next_pending_worker returns 0 explicitly on the no-match path — the
    `while read … done < <(…)` loop exits 1 at EOF, which under set -e would
    abort the LAST worker's Job before it spawns postflight (cluster upgraded
    but no cleanup / in_flight reset). Caught in review.

Docs (runbook + architecture + headers) updated to the dynamic topology.

NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was
deployed to node4/5/6 by hand this session. Baking it into node provisioning
(so new nodes get it automatically) is the remaining follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:56:02 +00:00
Viktor Barzin
0c5a9b5f44 k8s-version-upgrade: grant pods/log so preflight can verify the etcd snapshot
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Preflight step 6 confirms the pre-upgrade etcd snapshot is non-empty by parsing
the backup Job's log (`kubectl -n default logs job/pre-upgrade-etcd-...`). The
k8s-upgrade-job ClusterRole granted `pods` get/list/delete but NOT the `pods/log`
subresource, so the read failed with Forbidden in the default ns and aborted
preflight — after step 5 had already set k8s_upgrade_in_flight=1. A stale
out-of-band grant had masked this until a `terragrunt apply` in this session
reconciled the role back to its TF definition. Codify pods/log:get.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:52:52 +00:00
Viktor Barzin
bfb86e653f k8s-version-upgrade: ignore CoreDNS preflight on kubeadm upgrade plan too
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The prior commit added the CoreDNS ignore/skip flags only to `kubeadm upgrade
apply`, but `kubeadm upgrade plan` runs the SAME CoreDNS preflight. Once master's
kubeadm binary is on the target version (the first attempt's apt step already
bumps it), both plan calls fail on the Keel-drifted CoreDNS 1.12.4 under
set -euo pipefail and abort:
  - preflight Job step 4 (upgrade-step.sh) — `plan` output is grepped for the
    target version; the failing pipeline killed the whole preflight.
  - update_k8s.sh master path line 85 — bare `plan` before the apply.

Both now pass --ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins.
Verified read-only on master: plan exits 0 and still emits
"kubeadm upgrade apply v1.34.9".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:49:06 +00:00
Viktor Barzin
037a609f27 k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").

- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
  `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
  --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
  our custom split-horizon Corefile with kubeadm's default AND downgrade the
  image; --skip-phases leaves CoreDNS 100% untouched while the control plane
  upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
  quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
  claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
  GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:45:05 +00:00
Viktor Barzin
042d1ce1ac k8s-version-upgrade: CI-retrigger to apply D1 (missed by two-commit diff-base)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
fb638cd8 landed as two commits; the apply pipeline diffed against HEAD~1 (the
monitoring-only commit) and never applied stacks/k8s-version-upgrade, so the
retry-on-failure logic isn't live yet. This single-commit retrigger forces it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:28:58 +00:00
Viktor Barzin
fb638cd8ec k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.

Docs updated to match the behaviour change (per the same-commit docs rule):
  - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
    "kill a stuck Job" recovery now leads with retry-on-failure self-heal.
  - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
    retry-on-failure note on the deterministic-naming paragraph.
  - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
    entry, and drill-down (also copied to the active ~/.claude copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00
Viktor Barzin
dfa1a12a86 k8s-version-upgrade: retry failed phases + surface wedged chain (fix 5-day silent stall)
The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing.
On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the
devvm) was firing when the daily detection ran; the preflight's "halt on any
critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1).
Two design gaps then turned that blip into a multi-day wedge:

  * the detection guard and spawn_next only checked whether the phase Job
    EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via
    ttlSecondsAfterFinished, so every daily run skipped re-spawning it;
  * the abort happens before the in-flight metric is pushed, so neither
    K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported
    "never ran" while actually being stuck.

Fixes:
  D1 retry-on-failure: detection CronJob (main.tf) and spawn_next
     (upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job
     instead of skipping it, so a transient gate self-corrects next cycle
     rather than wedging the pipeline for a week.
  D2 WebterminalTtydUnreachable critical -> warning: a devvm developer
     web-terminal is not cluster infrastructure and must not block upgrades.
  D3 observability: new K8sUpgradeChainJobFailed alert
     (kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags
     a Failed chain Job as "chain failed" — closing the pre-in-flight blind
     spot so a wedge is visible immediately.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:07:36 +00:00
Viktor Barzin
7e7e41cbef fix(authentik): derive username from email in tripit-enrollment (user_write needs it)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The passwordless enrollment prompt collects only email+name, so user_write aborted with 'Aborting write to empty username' (ak-stage-access-denied). Add an expression policy on the user_write binding (evaluate_on_plan=false + re_evaluate_policies=true, like guest.tf) that sets prompt_data['username'] = the entered email before the write. Verified the failure live via the flow executor API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:35:23 +00:00
Viktor Barzin
e4512f3566 fix(authentik): deliver tripit email-verify stages via blueprint (provider token_expiry too old)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 214 failed: the pinned goauthentik 2024.x provider models EmailStage.token_expiry as an integer, but the live 2026.2.x server requires a duration string ('hours=24') and 400s any number (even the provider default 30). Bumping the provider is a global terragrunt.hcl change re-applying every platform stack + breaking 3 other authentik-using stacks' lockfiles — disproportionate. Instead the two email-verification stages + their flow bindings move into an Authentik blueprint (tripit-email-stages.yaml) applied server-side via authentik_blueprint; the server parses token_expiry natively. Validated on the live server + terraform validate. Restores the ADR-0020 email-verification security gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:30:05 +00:00
Viktor Barzin
89eb090be3 feat(authentik): tripit-enrollment + tripit-recovery flows (passwordless signup, ADR-0020)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Makes the WebLanding 'Sign up' button work (it was 404ing — the tripit-enrollment flow didn't exist). Open passwordless registration: prompt(email,name) -> user_write(INACTIVE, external, group 'TripIt External') -> email verification (activates) -> passkey -> login. The inactive-until-verified gate is the security boundary: tripit trusts X-authentik-email, so activation must require proving inbox ownership. Passwordless login already works via the built-in webauthn flow. tripit-recovery (email -> new passkey) is built but intentionally NOT wired into the global brand recovery, so admin recovery is unchanged. Schema validated with terraform validate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:20:11 +00:00
Viktor Barzin
4bf3f504ea fix(authentik): SMTP host = mail.viktorbarzin.me (svc name fails wildcard-cert verify)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The in-cluster svc name mailserver.mailserver.svc.cluster.local fails Authentik's strict STARTTLS hostname verification (CERTIFICATE_VERIFY_FAILED): the mailserver serves the *.viktorbarzin.me wildcard cert, which doesn't cover the svc DNS name. Use the public name mail.viktorbarzin.me, which resolves in-cluster (10.0.20.1) and matches the cert. Verified end-to-end from an authentik pod (verified TLS + SASL auth + send) before this change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:13:53 +00:00
Viktor Barzin
c3d0c121bb feat(authentik): wire SMTP (noreply@) for TripIt signup verification + recovery email (ADR-0020)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Authentik email was unconfigured (localhost), so the TripIt enrollment flow's email-verification stage couldn't send. Add AUTHENTIK_EMAIL__* to server.env + worker.env pointing at the in-cluster mailserver as noreply@viktorbarzin.me (587/STARTTLS), with the SASL password synced from Vault secret/authentik.smtp_password via a new authentik-email ExternalSecret (reloader-annotated). Image pin unchanged (2026.2.4 == live). Prereq for the tripit-enrollment flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 07:04:52 +00:00
Viktor Barzin
8a2a3d9eca Merge remote-tracking branch 'origin/master' into wizard/reconcile-mirror
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
# Conflicts:
#	scripts/t3-provision-users.sh
2026-06-16 22:32:43 +00:00
Viktor Barzin
63e714782c immich: remove one-shot anca-elements-import Job + its PVC
All of Anca's photos are imported. The Job was declared as
kubernetes_job_v1.anca_elements_import — meaning every `terragrunt apply` of
the immich stack re-created it, despite the 2026-05-25 in-code comment saying
"After successful completion: REMOVE this resource block + apply again."
Nobody noticed for 22 days; the re-trigger today (2026-06-16) was the
6th IO-pressure incident — it scanned all 21,643 assets in pure read-scan
mode for 51 min, saturated sdc, starved etcd, crash-looped kube-apiserver.

Recovery actions taken before this commit:
  - Throttled nfsd 64→8 on PVE host to give apiserver headroom
  - `kubectl delete job -n immich anca-elements-import` + force-delete pod
  - Restored nfsd to 64; cluster healthy

Code change here:
  - Remove `kubernetes_job_v1.anca_elements_import` block
  - Remove `module.nfs_anca_elements_host` (PVC `immich-anca-elements-host` —
    no live consumer; videos batch deferred per user, source dump remains on
    PVE at /srv/nfs/anca-elements, browseable via Nextcloud admin)
  - Update 2026-05-25 post-mortem: 6th-incident section + new lesson that
    one-shot Jobs do NOT belong in kubernetes_job_v1 (use a suspended CronJob
    or a runbook-captured `kubectl create job` ad-hoc invocation instead).
2026-06-16 22:11:27 +00:00
Viktor Barzin
88717c61fd immich-frame: whole library (last 2y), Ken Burns, weather, 30s interval
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Per Viktor: show the whole Immich library from the last 2 years instead of the
single 'china' album, enable Ken Burns pan/zoom, slow the interval to 30s, and
add the weather overlay (London, metric). OpenWeatherMap key is read from Vault
(secret/immich -> frame_weather_api_key), not hardcoded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 21:07:39 +00:00
Viktor Barzin
cffa32fae3 Merge remote-tracking branch 'forgejo/master' into wizard/tripit-ingest-model
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-16 20:39:30 +00:00
Viktor Barzin
14476bfbd7 tripit: mail-ingest extracts with the qwen3-8b text model, not the vision model
Forwarded schedule-change emails were being parsed by qwen3vl-4b (a 4B *vision*
model) for text extraction, which reliably dropped the flight number — so the
matcher had no key to link on and a forwarded flight update created a duplicate
instead of amending the existing segment.

Point the ingest-plans CronJob's text extraction at qwen3-8b (verified live: it
emits flight_number + a clean PNR, 3/3 on the failing email) and keep qwen3vl-4b
for boarding-pass image attachments (LLM_VISION_MODEL). llama-swap loads each on
demand; the GPU swap cost is accepted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:39:29 +00:00
Viktor Barzin
c6a5cbe227 feat(tripit): serve the SPA publicly, keep /api + /metrics forward-auth-gated (ADR-0020 landing)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The website 302'd unauthenticated visitors straight to Authentik. Split the tripit.viktorbarzin.me ingress: the SPA shell (everything else) becomes auth=none so the app shows its own Log in / Sign up landing page, while a new tripit-app-api ingress keeps /api + /metrics behind forward-auth — the security boundary, since /api trusts the outpost-injected X-authentik-email. The public SPA gets strip-auth-headers (no spoofed headers can reach the backend) and anti_ai_scraping=false (it's an installable PWA). The existing auth=none carve-outs (calendar, emails/confirm, planner/slack) are longer prefixes and keep winning. Pairs with the tripit landing-page deploy (commit 3fe4da1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:30:58 +00:00
github-actions[bot]
eb47eb1d10 priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-16 17:45:33 +00:00
github-actions[bot]
d1f2e50736 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-06-16 17:44:40 +00:00
github-actions[bot]
46b5f04f67 priority-pass: bump image_tag to 63e118c3 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 63e118c334
2026-06-16 17:20:08 +00:00
github-actions[bot]
29ad200026 priority-pass: bump image_tag to 4ce9e8e8 [ci skip]
Auto-committed by ViktorBarzin/priority-pass GHA on push to main.
Source: 4ce9e8e894
2026-06-16 17:19:55 +00:00
Viktor Barzin
f4f7705127 monitoring: adopt orphaned alert-digest resources into TF state (unblocks apply)
The monitoring stack apply was create-failing on every push with `configmaps
"alert-digest-script" already exists` + `secrets "alert-digest" already exists`
(modules/monitoring/alert_digest.tf) — both resources exist in-cluster but fell
out of Terraform state, so apply tried to CREATE them and errored. Pre-existing
(failed on pipelines 203 AND 204, NOT caused by the t3 alert-rules change). Add
import {} blocks (TF 1.5+ adoption per AGENTS.md) so apply imports + reconciles
instead of failing. Idempotent once imported; safe to remove after a green apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:31:17 +00:00
Viktor Barzin
994d305d04 t3: session-auth detection for the gated nightly tracker (dispatch fallback logging + Loki alerts)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Before auto-tracking t3 nightly builds (Viktor's call, risk accepted), stand up
the detection that was missing on 2026-06-09 — when an auto-pulled nightly broke
pairing for ALL users and nothing alerted. Viktor's explicit requirement: make
sure session auth keeps working and revert if the pairing fallback/failure rate
climbs. This is phase 0 (detection) of that work.

- t3-dispatch: exchangeCredential now reports WHICH pairing endpoint answered,
  and autoPair logs every outcome (paired user=.. endpoint=.. fallback=..) — so
  the real-user browser-session->bootstrap fallback rate is observable. A
  non-zero rate flags that a build moved the pairing API (the 2026-06-09 class).
- Loki ruler alerts (devvm journal -> Alertmanager -> Slack): T3PairingBroken
  (real users failing to pair), T3PairFallbackHigh (build moved the pairing API),
  T3AutoUpdateRolledBack / RollbackFailed / Frozen (enforcer outcomes). Closes
  the post-mortem's open "nothing monitors end-to-end pairing" detection gap.

The existing t3-probe only checks GET /api/auth/session==200, which stays 200
even when pairing is dead, so it never caught the outage class.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:56:55 +00:00
Viktor Barzin
e783cae2cb chrome-service + mam-farming: doc clarifications (+ re-trigger CI apply missed earlier)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Two small doc additions that also re-include these stacks in Woodpecker's
changed-stack detection. The earlier 2-commit push left chrome-service out of the
HEAD~1..HEAD diff so its ignore_changes fix never applied; the monitoring apply was
separately blocked by a stuck prometheus pending-upgrade (now cleared).

- chrome-service: note the live pod's container order had drifted from this file's
  order, so a TF apply reorders them (containers[0] differs live-vs-TF until the
  apply lands) -- documents the confusion this caused during diagnosis.
- mam-farming: cross-ref the grabber script that emits mam_grabber_last_run_timestamp.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:34:23 +00:00
Viktor Barzin
2479560fa2 mam-farming: make MAMFarmingStuck a grabber heartbeat, not a grab-count check
Some checks failed
ci/woodpecker/push/default Pipeline failed
MAMFarmingStuck fired whenever the freeleech grabber added 0 torrents in 4h, but
grabbing 0 is normal: the grabber searches a random catalogue offset each run and
legitimately finds nothing when freeleech is dry (account ratio was a healthy
37.5; the alert even misreported it as "0.00" because $value was the grabbed
count, not the ratio). The alert's real intent was to catch the grabber not
running at all (CronJob Forbid-blocked / wedged), but increase(grabbed[4h])==0
cannot distinguish "didn't run" from "ran, nothing to grab" since Pushgateway
serves the last pushed value forever.

The grabber now heartbeats mam_grabber_last_run_timestamp on every completed run
(main success, ratio/mouse skip, and qBittorrent-unreachable paths). The alert
fires only when that heartbeat is >4h stale — the true stuck condition. Cookie
expiry and qBittorrent-down keep their own dedicated alerts.

Surfaced by /cluster-health as a false-firing alert.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:18:33 +00:00
Viktor Barzin
a0725ede57 chrome-service: stop ignoring container[0].image so TF re-asserts the pinned browser image
The chrome-service container (container[0]) runs the pinned Microsoft Playwright
image, which ships chromium under /ms-playwright. Its image was still listed in
the deployment's lifecycle ignore_changes — a leftover KEEL_IGNORE from before
ADR-0002 #29 moved the novnc container to TF management. With that field ignored,
a stray clobber of container[0] to ghcr chrome-service-novnc:latest (which has no
chromium there) stuck permanently: the container crash-looped ~12h on "chromium
binary not found under /ms-playwright" (273 restarts) and TF could not revert it.

Remove container[0].image from ignore_changes so Terraform pins it to local.image
and re-asserts it on every apply. Both containers are TF-managed now (novnc since
ADR-0002 #29); Keel is inert (policy=never), so nothing should fight TF here.

Surfaced by /cluster-health. Live state was already restored transiently via
kubectl set image; this commit makes the fix durable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 08:18:32 +00:00
Viktor Barzin
57d45d8d8f fix(authentik): pin Vault binding UUIDs as literals (provider has no authentik_application data source)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
CI pipeline 198 failed: the pinned goauthentik/authentik provider has no data "authentik_application" source, so terraform failed the whole authentik plan and applied NOTHING (state unchanged). Replace the data-source lookups with the live pbm_uuid (Vault app) and group_uuid (Allow Login Users) as literals; authentik_policy_binding is supported (used in guest.tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:01:29 +00:00
Viktor Barzin
aa461b95bc feat(authentik): bind Vault OIDC app to Allow Login Users (close ADR-0020 OIDC gap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Audit found the Vault Authentik application had no authorization binding, so any authenticated identity (incl. a future self-enrolled TripIt External user) could complete Vault OIDC login and get a built-in default-policy token. Bind it to 'Allow Login Users' — existing homelab users inherit that group via its children (verified User.all_groups() includes the parent), parentless TripIt External users are excluded. Closes the only OIDC app the forward-auth fence does not cover.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 21:48:04 +00:00
Viktor Barzin
cbca281aaa feat(authentik): TripIt external self-signup group + forward-auth fence (ADR-0020)
Viktor wants people outside the homelab to self-register to TripIt with email + a passkey (no password), kept separate from the rest of the homelab. Adds the empty, parentless 'TripIt External' Authentik group and a first-position branch in the catch-all policy that admits those users to tripit.viktorbarzin.me only and denies every other forward-auth host. Inert on apply (group empty => matches no existing user => no lockout). An adversarial review found the fence is forward-auth-only, so the runbook records the OIDC-app containment audit (every sensitive app already requires a trusted group External users won't hold), the Vault->Allow Login Users binding that closes the one open OIDC app, the SMTP prerequisite for email verification, and the before/after access-matrix verification. Flows/SMTP/Vault binding are UI steps per the runbook; the push that applies the catch-all edit must be human-watched (CI auto-applies the authentik stack).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 21:48:04 +00:00
Viktor Barzin
5d3a166b94 t3-afk: fix agent Bash — stop mounting into ~/.claude
Some checks failed
ci/woodpecker/push/default Pipeline failed
Root cause of "the agent never commits": the issue-implementer CLAUDE.md was
subPath-mounted at /home/node/.claude/CLAUDE.md, which made /home/node/.claude
root-owned. The agent (uid 1000) then couldn't create its Bash session-env
there, so EVERY Bash/git call failed (Write/Edit worked, so it silently edited
but never committed). Found by reading the agent transcripts from
state.sqlite -> projection_thread_messages.

Fix: don't mount anything into ~/.claude (it's not honored by T3's SDK anyway).
Behaviour is injected via the dispatch message preamble by the control plane;
files/issue-implementer-CLAUDE.md kept as the canonical source text.

Verified post-fix: a preamble-dispatched task edited README and COMMITTED
(073ab28) unattended.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:49:34 +00:00
Viktor Barzin
34c30ac2bf t3-afk: auto-pair dispatcher sidecar — no manual pairing
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The bare `t3 serve` behind Authentik showed the manual /pair#token screen, which
didn't connect. Mirror the devvm t3-dispatch: a small stdlib-Node sidecar fronts
t3 serve, and on a cookieless (already Authentik-gated) document load it mints a
pairing credential (`t3 auth pairing create`) and exchanges it at
/api/auth/browser-session for the t3_session cookie, then 302s back. Everything
else — including WebSocket upgrades for the live cockpit — reverse-proxies to
:3773. The Service now targets the sidecar (:8080).

Verified: cookieless GET -> 302 + Set-Cookie t3_session; cookied GET -> 200 SPA.
Matches the t3.viktorbarzin.me experience (Authentik login -> straight into the
cockpit).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:19:39 +00:00
Viktor Barzin
82a0c5aedf t3-afk: fix crashloop — exclude from Keel at the deployment level
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Keel "patch"-downgraded the image docker.io/library/node:24 -> library/node:24.0.2,
which is below t3@0.0.27's required node >=24.10, so `t3 serve` exited silently and
the pod crash-looped (~160 restarts / 13h).

Root cause: keel.sh/policy=never was on the POD-TEMPLATE labels, but Keel reads the
policy at the DEPLOYMENT level. The cluster's Kyverno inject-keel-annotations is
opt-out, so it stamped policy=patch and Keel acted on it.

Fix: set keel.sh/policy=never as a deployment-level annotation; ignore_changes the
Kyverno-injected keel.sh/pollSchedule + keel.sh/trigger annotations; the image stays
TF-owned (apply reverted Keel's downgrade). Pod now 1/1, t3 serve 200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 10:32:38 +00:00
Viktor Barzin
214638216b fix(anisette): wait_for_rollout=false so a slow first start can't strand the deploy out of state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The docker.io fix created the deployment, but wait_for_rollout (default true)
then hung on the OOMing pod and the apply failed — leaving the deployment in
the cluster but NOT in terraform state, so every later apply hit
'deployments.apps "anisette" already exists'. Deleted that orphan and set
wait_for_rollout=false (mirrors tts/llama-cpp slow-start services); readiness
probe still gates Service traffic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:56:30 +00:00
Viktor Barzin
d8c60d7ab8 t3-afk: dedicated in-cluster T3 Code instance (AFK executor + cockpit)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Slice #2 of claude-agent-service PRD #1 (AFK implementation pipeline). Dedicated
in-cluster T3 Code instance the control plane dispatches issues into; runs the
issue-implementer agent in a git worktree with a live cockpit. Applied + live
2026-06-14 (9 resources).

Pilot-fast: stock docker.io/library/node:24 + install pinned t3@0.0.27 + Claude
CLI at startup onto an SSD-NFS PVC. Authentik-gated ingress. issue-implementer
behaviour ships as a user-level ~/.claude/CLAUDE.md (T3 hardcodes the system
prompt; settingSources loads it) and forbids plan-mode/clarifying-questions so
unattended threads don't stall. Keel-excluded (ADR 0003). wait_for_rollout=false
(slow first start). Image fully-qualified for the Kyverno trusted-registries
allowlist; container mem limit 4Gi (tier-aux LimitRange cap).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 20:06:33 +00:00
Viktor Barzin
bc7b28244f fix(anisette): raise memory limit to 512Mi — 128Mi OOMKilled at startup
Some checks failed
ci/woodpecker/push/default Pipeline failed
The pod CrashLooped with OOMKilled (exit 137): anisette downloads and
initializes Apple's CoreADI provisioning library on startup, spiking past the
128Mi limit before it can bind :6969 (empty logs, liveness 'connection
refused'). Bump request 256Mi / limit 512Mi; steady state is much lower.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:54:13 +00:00
Viktor Barzin
96addf65b4 fix(anisette): docker.io/ image prefix to pass Kyverno require-trusted-registries
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
First apply was denied at admission — a bare dadoum/anisette-v3-server@sha256
ref isn't in the trusted-registries allowlist (only enumerated DockerHub
user-repo prefixes are). docker.io/* IS allowlisted, so use the explicit
registry prefix; still pulls via the 10.0.20.10 pull-through cache.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:47:05 +00:00
Viktor Barzin
0bfa6f0774 feat(anisette): self-hosted Apple anisette server for SideStore (infra #40)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Deploy a small stateless anisette-data server so the TripIt iOS Shell can be
sideloaded with SideStore using a free Apple ID, without brokering the
Apple-ID auth dance through a public third-party anisette server (which would
see every login). SideStore points at a stable internal endpoint we control.

- Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server
  for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub
  releases / semver / sha tags), so pinned by manifest digest instead of a tag
  per the "never :latest" rule. Pulled from DockerHub via the registry-VM
  pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so
  a new upstream build prompts a digest re-pin.
- Stateless: emptyDir backs the provisioning-library cache dir (regenerable
  download; upstream issue #23 means it doesn't preserve client auth across
  restarts anyway) — no PVC, no Vault secret.
- Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none,
  allow_local_access_only, ssl_redirect off) — SideStore is a native client
  that can't do the Authentik cookie dance, same reasoning as android-emulator's
  adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never
  publicly exposed.

Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service
catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:35:57 +00:00
Viktor Barzin
fe1f8d62e7 tripit: re-apply tripit stack to land CITY_IMAGE_PROVIDER=wikipedia
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The commit that enabled real city cover photos (a69847a0,
CITY_IMAGE_PROVIDER=wikipedia, #47) was committed to master but its CI run
skipped the tripit stack apply (changed-stack diff race — same class as the
prior "re-apply after pipeline race" fixes). The env never landed in-cluster,
so the provider stayed on its fake 1x1-PNG default and every trip/stay cover
rendered blank/placeholder in prod. This comment touch forces CI to re-apply
the tripit stack; terraform then reconciles the drift (desired HCL already
has the env) so the deployment picks up CITY_IMAGE_PROVIDER=wikipedia.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:45:07 +00:00
Viktor Barzin
2df6ebf305 health: fix middleware ref namespace prefix (restore site from 404)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
My previous commit referenced the new limiter as `health-rate-limit@kubernetescrd`,
omitting the namespace prefix. Traefik CRD middleware refs are
`<namespace>-<name>@kubernetescrd`, and the Middleware lives in the `traefik` ns,
so the router couldn't resolve it — Traefik failed the whole
health.viktorbarzin.me router and returned 404 on every path (the app + pod were
healthy throughout; verified via port-forward).

Correct it to `traefik-health-rate-limit@kubernetescrd`, matching the working
traefik-tripit-rate-limit / traefik-actualbudget-rate-limit references.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 17:43:08 +00:00
Viktor Barzin
086ff85911 health: dedicated 100/1000 rate limit for the redesigned SPA
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor hit 429s browsing the redesigned health app. The default shared limiter
is 10 req/s / burst 50, but each page load is the shell (JS chunks + two
self-hosted Geist woff2) plus a 5-8 call API burst, so fast tab-to-tab
navigation from one client IP overruns burst 50 — Traefik 429s the tail and the
affected cards/pages render empty.

Give health its own limiter (average 100, burst 1000) and skip the default,
exactly as tripit/immich/actualbudget/ha-sofia already do for the same
parallel-burst pattern. Attached via the ingress_factory escape hatch
(skip_default_rate_limit + extra_middlewares).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:03:51 +00:00
Viktor Barzin
6dc77f4612 uptime-kuma: add CONTEXT.md + ADR-0001 (intentionally lean; sizing/placement review)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Documents the 2026-06-13 right-sizing review: Kuma is already lean (~1 check/s, 227 monitors mostly at 300s, 77MB on shared MySQL, 30d retention); the 'scraping too much' concern traced to a fixed socket.io login-timeout incident, not load. Records the deliberate decisions (keep per-service [External] monitors over canaries; keep datastore on shared mysql.dbaas) with rejected alternatives + rationale, plus the known internal-sync no-prune gap (stale Goldilocks monitor cleaned up by hand).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 09:11:22 +00:00
Viktor Barzin
05bec26d09 health: internal test-access ingress + DEV_AUTH_EMAIL (ADR-0008)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add health-test.viktorbarzin.lan (auth=none, allow_local_access_only,
anti-AI off) pointing at the same health deployment, plus a
DEV_AUTH_EMAIL=vbarzin@gmail.com env on the container. Lets automated
E2E / Playwright / manual screenshots reach the live app without the
Authentik SSO redirect, for testing — while the public
health.viktorbarzin.me ingress stays auth=required (forward-auth fails
closed, so the public path always carries the real X-authentik-email
header and never hits the DEV_AUTH_EMAIL fallback). LAN-only, no public
exposure. Decision recorded in health repo ADR-0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 04:02:34 +00:00
Viktor Barzin
e6699ed20b uptime-kuma: retry Kuma login in monitor-sync jobs (intermittent socket.io timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal + external monitor-sync CronJobs intermittently failed with socketio.exceptions.TimeoutError on api.login(), firing JobFailed -> Slack noise (and leaving monitor sync stale). Kuma 2.3.2 itself is healthy (1/1, 30m CPU); its single Node event loop just briefly stalls under ~300 monitors so the socket.io login handshake occasionally exceeds the client timeout. Wrap connect+login in a 5-attempt / 15s-backoff retry (disconnecting the half-open client between tries) so a transient stall no longer fails the whole job. Applied to both sync scripts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 20:54:14 +00:00
Viktor Barzin
a6381b8cf8 forgejo: custom 8Gi ResourceQuota (was pegged at the 4Gi tier cap)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Yesterday's Forgejo 3Gi->4Gi OOM fix pushed its tier-3-edge namespace quota (requests.memory=4Gi) to 100%, firing KubeQuotaAlmostFull + the healthcheck resourcequota check. Forgejo is the git + OCI-registry backbone and legitimately needs ~4Gi, so the edge tier's 4Gi ceiling is too tight. Opt the namespace out of the auto tier quota (resource-governance/custom-quota=true) and define a forgejo-specific ResourceQuota at requests.memory=8Gi, so the 4Gi pod sits at ~50% with headroom. Same opt-out pattern dbaas uses. Re-tiering was rejected: tier 1-cluster is also 4Gi, and 0-core (8Gi) would over-classify Forgejo's priority/eviction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 17:16:47 +00:00