Viktor's tour-guide redo (tripit#30/#31): narration audio is baked-audio-only
now — the fill-tour-audio worker synthesizes the queued (story, telling,
voice) audio while the tts stack's off-peak window (02:00-06:00) has
Chatterbox scaled up. Two idempotent passes: 02:20 after scale-up + model
load, 04:30 insurance against a skipped window or guard yield. Daytime runs
record tts_unreachable and exit quietly by design.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
v2's marker fix proved the install completes, but avdmanager still saw
no system images: it IGNORES ANDROID_SDK_ROOT (and has no --sdk_root),
deriving the SDK root from its own toolsdir — /opt/android in our image,
while packages live on the PVC at /sdk. v3 seeds cmdline-tools into
/sdk/cmdline-tools/latest once and runs avdmanager from there, so it
resolves the PVC as the SDK root.
First boot crashed mid-SDK-install, and the dir-existence check then
skipped reinstall forever: avdmanager saw the partial tree and died with
'Valid system image paths are: null' (CrashLoopBackOff). v2 tracks
install completion with a marker file written only after sdkmanager
succeeds + package.xml exists, wipes partial system-image trees before
reinstalling, and retries sdkmanager 3x.
The five imports from the last recovery commit are in state now (verified
serial 4: everything except the deployment). The deployment kept falling
out of state between runs, so instead of a third import round the broken
0-replica deployment object was deleted live (transient recovery step,
presence-claimed) and this apply recreates it Terraform-owned with the
quota-fitting 3Gi requests. Import stanzas must go because TF 1.5 errors
on importing already-managed addresses.
Pipeline 88 imported the namespace but its refresh dropped the PVC, both
services, the ingress and the tls secret from state (PG-backend state
races on this new stack's first applies), so the apply again died on
'already exists' conflicts. State now holds namespace+deployment; adopt
the missing five with import blocks (TF 1.5 errors on importing
already-managed addresses, so only the missing set is listed). Stanzas
come out once applied.
Pipeline 85 created the namespace but a Terraform pg-backend
workspace-creation lock race (new stack schema initializing while other
stacks applied concurrently) left it out of the recorded state — every
later apply then died with 'namespaces android-emulator already exists'.
Adopt it with an import block per the house recovery pattern; stanza
gets removed once it has applied.
First deploy hit 'exceeded quota: tier-quota, requested requests.memory=8Gi,
limited 4Gi' — the generated tier-1 quota caps memory REQUESTS at 4Gi but
allows 32Gi of limits, so go Burstable (requests 3Gi, limits 8Gi) like
tiers 3/4 do, instead of opting the namespace out via custom-quota.
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
The Woodpecker default pipeline selects stacks via git diff HEAD~1 HEAD;
on a merge commit that is the first-parent diff, which contained only the
concurrently-landed files — stacks/tts never got applied (namespace still
absent) and the kyverno re-trigger push got no pipeline at all. Single
non-merge commit touching both stacks so the detector sees them; the
sorted loop applies kyverno before tts, the order tripit#26 requires.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's tour-guide redo (tripit#26): 87702bdc committed this stack with
[ci skip] so it was never applied — prod tripit has been pointing at a
nonexistent chatterbox-tts service since. This commit triggers the apply
and fixes the voices path: config pointed predefined_voices_path at the
NFS PVC (/data/voices), which nobody can seed without NFS-host shell
access and which would leave /v1/audio/voices empty (it gates readiness).
Use the 28 voices bundled in the image at /app/voices instead; /data
keeps reference audio (future cloning) and the HF model cache.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's tour-guide redo (tripit#26): the Chatterbox TTS go-live commit
87702bdc carried [ci skip], so CI never applied the kyverno change that
keeps the tts namespace out of low-GPU-priority injection. This comment-only
commit makes CI apply the already-committed change — step 1 of the
kyverno -> tts -> tripit apply order.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's tour-guide redo (tripit#24, slice tripit#25): the feature shipped
dark on 2026-06-08 because these three env vars were never set, so prod ran
the fake test-fixture providers — the only sight users ever saw was the
placeholder 'Sight 1' narrated by browser TTS. Flips discovery to Wikipedia
GeoSearch, story material to the five real web sources, and script-writing
to claude-agent-service (token already present in tripit-secrets).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CALENDAR_CONFLICT_PROVIDER=nextcloud + CalDAV base/user on the deployment, and the read-only app-password via tripit-secrets (seeded in Vault secret/tripit). Lets the planning workspace's calendar_check column flag date clashes against the owner's Nextcloud calendar. Same image-first hold-order as the fare scrape — pushed only after the #19 image is live.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The fare scrape's first E2E test was blocked by chrome-service-ws-ingress (9222 admits only namespaces labeled chrome-service.viktorbarzin.me/client=true). Label the tripit namespace per that policy's opt-in design so the planning workspace's live fare fetches reach the shared browser.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keel flip-flops the pgbouncer container's imagePullPolicy, so the
declared Always kept re-diffing on every plan. Ignore it like the
image tag (KEEL_IGNORE pattern) — plan-to-zero restored.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.
Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
pattern) so applies stop stripping live Keel state
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:
- values.yaml: the authentik.* Helm values (gunicorn workers, cache
timeouts, conn_max_age) were silently INERT because existingSecret
skips chart env rendering — pods ran defaults (2 workers, 300s
caches, no persistent DB conns). Moved all tuning into
server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
password_stage so username+password render on ONE screen (the
separate order-20 password binding is deleted via API — authentik
requires that when embedding). Outpost log_level trace->info and
1->2 replicas (it is on the hot path of every forward-auth request;
PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
Cache-Control (assets are version-fingerprinted but served with no
max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
opening a fresh TCP connection to the outpost per subrequest) +
config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
(it is a live CNPG primary-selector compatibility service).
Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Bound connection establishment via session ClientTimeout(total=None,
connect=15) instead — works on 3.9 through current; total must stay None
or the session timeout would kill the long-lived probe WS. Verified by a
local 14s smoke run: cloudflare + internal legs both connect.
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.
The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.
Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
Viktor asked to root-cause the frequent t3 code disconnects and rule
infra in or out. The tunnel pods ran bare 'cloudflared tunnel run':
every Cloudflare release made the binary self-update and exit (code 11),
restarting all 3 pods and severing every WebSocket riding the tunnel —
one of the confirmed infra-side drop causes (pods cycled 2026-06-09
20:55/21:00 and 2026-06-10 02:31). Updates belong to pod image rollouts,
not in-place binary swaps.
The Actual web app boots with ~70 near-parallel requests (55
/data/migrations/*.sql + statics, all served cache-control max-age=0 so
every page load re-validates them). The shared rate-limit middleware
(average 10, burst 50) 429s the tail of that storm, so every cold boot
shows 'Server returned an error while checking its status' and every
load stalls in retry backoff — measured up to 5min stalls when two
loads from one IP overlap. Viktor asked to relax the limit after the
anca slow-load investigation (beads code-7zv).
Same pattern as immich: dedicated actualbudget-rate-limit middleware in
the traefik stack, budget-* ingresses opt out of the default via
skip_default_rate_limit + extra_middlewares.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Completes the internal port table of the mail front door (10.0.20.1):
443 was squatted by the pfSense webGUI (self-signed cert expired 2022),
so internal webmail and the kuma [External] mail probe hit the firewall
login instead of Roundcube — the last leg of the mail split-brain name.
Design (Viktor): route by what the client asked for. New HAProxy
frontend internal_https_443 (binds 10.0.20.1+10.0.10.1 :443, mode tcp):
SNI present -> Traefik .203 with send-proxy-v2 (trusted, IPv6-bridge
pattern, no health check per the PROXY-probe gotcha); SNI of
pfsense.viktorbarzin.{lan,me} or NO SNI (bare-IP admin access) -> webGUI,
which moved to :8443 (invisible to habits — https://10.0.20.1 still
lands on the login page; :8443 doubles as direct fallback). The
reverse-proxy pfsense ingress now targets :8443 directly.
Declared idempotently in pfsense-haproxy-bootstrap.php; config.xml
backed up on-box (config.xml.bak-2026-06-10-pre-sni443). Verified:
bare IP -> GUI login; pfsense.viktorbarzin.lan -> GUI;
pfsense.viktorbarzin.me -> 302 via ingress; mail.viktorbarzin.me ->
Roundcube with STRICT cert validation; :993 IMAPS untouched.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to verify the book-plotting push->build->deploy chain.
The chain itself is healthy, but the Terraform baseline image said
ancamilea/book-plotter:latest while CI (GHA on
PassionProjectsAnca/Plotting-Your-Dream-Book) builds and deploys
viktorbarzin/book-plotter:<sha8> + :latest — a from-scratch apply
would have resurrected a stale March image. Baseline now
viktorbarzin/book-plotter:latest. No live change: the running tag is
CI-owned via ignore_changes, plan confirms the image attr is ignored.
[ci skip] deliberately: plan shows UNRELATED pre-existing drift on
this stack (live ns labels managed-by=vault-user-onboarding +
resource-governance/custom-quota=true would be stripped; deployment
keel.sh/policy=patch annotations removed) — auto-applying that needs
its own reviewed pass.
The roundtrip probe kept failing after the SPF/MX fix: rspamd's actual
junk-score driver was R_DKIM_PERMFAIL(+4.5) on selector brevo2 — Brevo
signs with brevo1/brevo2._domainkey, which are CNAMEs to
b{1,2}.viktorbarzin-me.dkim.brevo.com in public DNS and were absent
from the internal zone (the earlier existence check used ANY queries,
which Cloudflare refuses per RFC 8482 — false negative). The DKIM
permfail also cascaded into DMARC_POLICY_SOFTFAIL(+1.5), totalling the
6.09/6.0 junk threshold; sieve filed probes into \Junk where the INBOX
poll never finds them.
ingress-dns-sync now maintains both selector CNAMEs. Ops notes: rspamd
caches DNS (restart to flush after zone fixes); CoreDNS denial cache
holds NXDOMAINs up to 300s. Verified: roundtrip SUCCESS in 20.5s.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two fixes from the post-DNS-internalization health sweep:
1. The internal viktorbarzin.me zone served only ingress A/CNAME records.
Since the mailserver pods now resolve the domain through it (CoreDNS
viktorbarzin.me:53 -> Technitium, 59a531b8), rspamd's SPF checks on
inbound @viktorbarzin.me mail saw SPF=none and quarantined it — the
Brevo email-roundtrip probe failed from the 16:20 run onward
(EmailRoundtripFailing/Stale). The ingress-dns-sync CronJob now also
maintains the static mail-auth records (SPF, brevo-code TXT, MX;
DMARC + DKIM were already present), idempotently. Principle: the
internal zone must be a SUPERSET of the public zone for every record
type internal clients consume. Verified in-pod: all four types
resolve; roundtrip re-probe green.
2. cluster_healthcheck #30 queried instant `up`, which goes stale for
~5 of every 10 minutes on the deliberate 10m redfish-idrac remnant
job -> intermittent false "redfish-idrac=missing". Now uses
last_over_time(up[15m]) — same answers for fast jobs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Forward the viktorbarzin.me:53 pod block to the Technitium ClusterIP
(10.96.0.53, same as the .lan block) instead of 8.8.8.8/1.1.1.1. Pods
become ordinary internal clients (CNAME -> apex -> live Traefik LB;
mail -> 10.0.20.1), fixing the 27 non-proxied [External] uptime-kuma
monitors that rode the TP-Link NAT loopback (hard-down since 06-09;
loopback refuses flows whose source equals the reflection target, which
all pfSense-SNAT'd cluster traffic does).
Enabled by re-testing a stale premise: on k8s 1.34 pods DO reach the
ETP=Local Traefik LB IP (kube-proxy short-circuits in-cluster traffic
to LB IPs; verified from pods on three non-Traefik nodes) — re-verify
after major k8s upgrades; canary = [External] fleet going red. The
NAT-layer alternatives (pfSense rdr, SNAT-drop) were rejected: both
fight return-path asymmetry and deepen TP-Link dependency.
Verified in-pod: immich -> .203 + HTTPS 200, mail -> 10.0.20.1,
forgejo -> Traefik ClusterIP (pin kept for Technitium-outage
resilience). Proxied [External] monitors now test the internal path —
true edge fidelity moves to the external vantage (ha-london, next fix).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The keep-set (newest 10 versions + latest + *cache* tags) treats
multi-arch/attestation index CHILDREN — separate untagged sha256
versions — as deletable: for images not rebuilt recently they sort
outside the newest-10 window and were pruned while their kept parent
index survived. kms-website :latest and :dfc83fb children 404'd
(RegistryManifestIntegrityFailure, caught by forgejo-integrity-probe
within hours; deployed tag a794d1a unaffected).
Healed: :latest re-pointed at the intact a794d1a index (also the
newest commit), corrupt :dfc83fb version deleted, probe re-run clean
(0 failures / 22 repos / 63 tags / 59 indexes). DRY_RUN=true applied
live. Re-enable only with a container-aware keep-set — options in the
post-mortem.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.
NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Round 3 of the forgejo-pull hairpin fix (per Viktor: no per-node
customization — split-brain lives in the DNS infra):
- pfSense Unbound domain override viktorbarzin.me -> Technitium
10.0.20.201 (applied via php write_config, backup on-box). Every
Unbound client on every VLAN now gets the internal split-horizon
answers (live Traefik IP via apex CNAME) with zero per-host config.
- CoreDNS carve-out (TF, applied): dedicated viktorbarzin.me:53 block —
forgejo pinned to Traefik ClusterIP via data source (pods cannot reach
the ETP=Local LB IP pfSense now returns), all other .me names kept on
public resolvers (pods' pre-existing behavior). Replaces the .:53
forgejo rewrite.
- Removed the same-day resolved routing-domain drop-ins from all 7 nodes;
node5/6 link DNS repointed Technitium -> pfSense (netplan + qm 205/206)
for fleet parity; cloud-init no longer writes any DNS drop-ins.
- Docs: dns.md, pfsense-unbound runbook (override + rollback), registry
bullet, post-mortem final-architecture addendum.
Verified: nodes resolve forgejo -> .203 via pfSense, crictl pull OK,
pods resolve forgejo -> ClusterIP / others -> public, mail record works,
.lan zone unaffected.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New `infra/stacks/tts/` deploys devnen/Chatterbox-TTS-Server (OpenAI-compatible
/v1/audio/speech) as ClusterIP `chatterbox-tts.tts.svc:8000` (server listens on
8004; Service remaps), requesting ONE T4 time-slice. Mirrors stacks/llama-cpp/.
Option A off-peak control (no VRAM isolation on the time-sliced T4 — see
post-mortem 2026-06-02): Deployment sits at replicas=0; three Europe/London
CronJobs own the replica count — `chatterbox-window-up` scales to 1 at 02:00
ONLY IF a free-VRAM preflight passes (sum gpu_pod_memory_used_bytes from
gpu-pod-exporter; free = 16GiB - used >= floor), `chatterbox-vram-guard` yields
the card mid-window if a resident wakes, `chatterbox-window-down` scales to 0 at
06:00. tripit's bake is best-effort + cached-forever (ADR-0002/0004) so a
skipped/aborted window backfills next time. SA+Role+RoleBinding grant the
CronJobs deployments/scale (nextcloud-watchdog pattern).
Polite-tenant hardening: kyverno `inject-gpu-workload-priority` now excludes the
`tts` namespace (new `gpu_priority_excluded_namespaces` local) so Chatterbox
keeps tier-2-gpu priority (600k) and is always evicted first under GPU pressure
— never immich-ml/frigate/llama-swap. The LimitRange-fallback policy still uses
the base exclude list (tts untouched there).
tripit: add TTS_MODE=openai_compatible, TTS_BASE_URL, TTS_MODEL=chatterbox to
local.app_env (no token — ClusterIP only). No tripit code change.
Image build is documented in stacks/tts/README.md (devnen cu128 target ->
forgejo.viktorbarzin.me/viktor/chatterbox-tts) — build is impractical inline
(large CUDA image + needs the upstream repo). NOT APPLIED — review branch only.
Free-VRAM floor (var.vram_free_floor_bytes, default 6GiB) must be set from the
measured chatterbox-multilingual T4 peak during the first bake.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Heavy in-cluster builds (e.g. tripit buildkit) were taking Forgejo down via
two vectors. Fixes both, without moving Forgejo off the sdc HDD (code-oflt
deferred):
- Memory 1Gi -> 3Gi (requests=limits). Forgejo was OOMKilled (exit 137) under
registry-push load; VPA upperBound ~1.5Gi was suppressed by the 1Gi cap it
kept OOMing against. Size for the push spike.
- Activate registry retention (DRY_RUN false). Verified the delete list
against all running viktor/* images first: 0 running images affected.
Pruned 478 -> 161 package versions; PVC was at its 50Gi autoresize ceiling.
- FIX broken retention auth: the cleanup PAT was ci-pusher's, but Forgejo
scopes container packages per-user, so DELETE on viktor/* returned 403 (the
dry-run only did GETs, hiding it). Repointed forgejo_cleanup_token to
viktor's write:package PAT. Retention had never actually worked.
- Protect buildkit *cache* tags from retention (cleanup.sh keep-set) so the
gentler-builds layer cache survives daily pruning.
[ci skip] — already applied via scripts/tg.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Boarding-pass images, no embedded DB. Drops LUKS-at-rest (low-sensitivity, accepted).
21.8M copied + verified on NFS; pod 2/2 on NFS; frees one proxmox-csi slot.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New authentik_group 'T3 Users' (members wizard/emo/ancamilea via data lookups — usernames ARE their emails in this Authentik instance) + a branch in the admin-services-restriction expression policy gating t3.viktorbarzin.me to that group, placed BEFORE the ADMIN_ONLY_HOSTS early-return. Surgical two-step targeted apply (group-with-members first, then the gate) → zero lock-out window. Verified: group has all 3 members, the live policy contains the t3 branch, t3 still 302s to Authentik. Membership is HCL for now (FUTURE: roster-reconciled via the Authentik API).
Note: the authentik stack had 3 unrelated pending drift changes (pgbouncer deployment + 2 tls_secrets) — deliberately NOT applied (targeted apply isolated this change; left for the stack owner).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New oidc-power-user-readonly ClusterRole (cluster-wide get/list/watch, NO secrets/exec/write); the power-user binding re-pointed to it (the existing read+write+secrets oidc-power-user role is retained but UNBOUND per ADR-0005). Applied to the rbac stack (2 add, 1 change, 0 destroy). emo added to Vault k8s_users (secret/platform) as power-user, email emil.barzin@gmail.com — the OIDC email IS the Authentik username (verified live). Verified via impersonation: emo gets cluster-wide read, NO secrets/write/exec/delete; anca unchanged.
Provisioner: install_user_kubeconfig writes a per-user OIDC kubeconfig (kubelogin/PKCE — the kubernetes Authentik client is public, no secret; server+CA copied from the admin kubeconfig) if-absent. Written for emo + ancamilea (0600). End-to-end login is interactive (browser OIDC); verified config validity + RBAC, not the live browser flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser
challenges break native clients). Bot defense is layered instead:
- Traefik rate-limit Middleware on a path-scoped /register ingress carve-out,
keyed on request Host (GLOBAL /register cap) not source IP — the host is
reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE
tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min,
burst 20, per replica; CrowdSec is the hard backstop on both paths.
- Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing
#security Slack receiver (matches "registered on this server", never the
rejection line). tuwunel's admin bot also posts signups to the admin room.
Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert).
Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so
[ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).
Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.
Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- ExternalSecret gains SLACK_SIGNING_SECRET / TREK_USER / TREK_PASSWORD /
CLAUDE_AGENT_TOKEN (SLACK_BOT_TOKEN reused from nudges).
- New auth=none ingress carve-out /api/planner/slack (Slack v0 signature-gated,
same pattern as the calendar + emails-confirm carve-outs).
- Remove the superseded standalone stacks/trip-planner (merged into tripit per
the "future travel logic goes in tripit" policy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prod ran FLIGHT_PROVIDER=fake, so every flight gate/terminal/time/position was
fabricated from a hash and never matched reality. Switch to real providers:
- FLIGHT_PROVIDER=aerodatabox (RapidAPI free BASIC; AERODATABOX_API_KEY via the
tripit-secrets ExternalSecret)
- RAIL_PROVIDER=realtimetrains (RTT_API_TOKEN, already in Vault)
- poll-flights cron */30 -> hourly to respect the free 600 req/month cap
(provider also self-throttles to <=1 req/sec)
Verified live: /api/segments/<LS1468>/status returns source=aerodatabox with
real schedule/terminal/aircraft.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New public static site at stem95su.viktorbarzin.me serving the school's
Bulgarian STEM platform (dashboard + lessons/games, externally authored
HTML/media exported from Gemini).
- Stock nginx:1.28-alpine serving /srv/nfs/stem-site read-only (nfs_volume),
NOT image-baked — content updated out-of-band (Nextcloud "PVE NFS Pool"
or rsync), no rebuild; auto-backed-up offsite by nfs-mirror.
- ingress_factory auth="none" (open; CrowdSec + ai-bot-block at the edge),
dns_type="proxied" (Cloudflare CNAME auto-created).
- nginx ConfigMap sets index stem_board.html (the dashboard) for "/".
- Docs: service-catalog entry + new "Static Site Hosting" pattern
(NFS-backed vs image-baked) in patterns.md.
Applied via scripts/tg apply; verified live end-to-end (dashboard, 20MB
page, video byte-range, no Authentik redirect) through the public
Cloudflare path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Root cause of "barely serving films": Real-Debrid's May-2026
infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate
new content), while degraded sources starved candidates. RD account +
popular-title availability were healthy throughout (library 32/36
unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams).
Runtime config (AIOStreams PG, applied via API — not in this diff):
- Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title)
and was silently dropping the bulk of its results at the 5s cutoff;
Interstellar 430 -> 987 streams after the bump.
- Removed MediaFusion preset: broken upstream ("Invalid configuration"
-> 500 Internal Server Error), contributed 0 usable streams, only a
dead [X] entry in every list.
This diff (Terraform):
- Harden aiostreams-stream-probe: test series AND movie paths, per-source
breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count,
success gated on Comet being alive. The old probe counted only Breaking
Bad streams and stayed green while new-content playback was broken.
- service-catalog: reflect source set + probe behaviour.
[ci skip] — probe already applied via targeted `tg apply` + verified
(series=378 movie=898 comet=206 errors=0 success=1); skipping the full
servarr reconcile to avoid touching unrelated pre-existing drift
(qbittorrent MetalLB annotation, tls_secret cert revert).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Gheorghe fixed his tag format 2026-06-06 (v.1.1.1 -> valid v1.1.1 / v1.1.3), so
drop the :latest+force+match-tag digest workaround and track semver properly:
policy=major (all upgrades, cumulative), match-tag removed (so Keel is free to
climb to higher semver tags), image floor pinned to v1.1.3. Pull policy ->
IfNotPresent (correct for a pinned Keel-managed tag; Always was only needed for
the mutable :latest). Running v1.1.3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resource changes/deletions are now attributable (the novelapp deletion this week
was untraceable because apiserver audit was off). Low-write policy: drops
reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into
the kube-apiserver static-pod manifest + kubeadm-config (v1beta4
extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails
/var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}.
Root cause that had silently blocked this AND OIDC for weeks: a stray
kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate
static-pod manifest kubelet ran instead of the real one, dropping every flag
added to the real manifest. Removed it. Runbook added.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keel was stuck on v1.0.3 because upstream mghee/novelapp tags newer releases as
`v.1.1.1` (dot after v), which isn't valid semver, so policy=all couldn't see
past the highest parseable tag. :latest correctly points at the newest release,
so switch to force + match-tag digest-tracking of :latest (Kyverno does not
manage match-tag, contrary to the stale code comment). Imports the live
Deployment (recreated out-of-band 2026-06-06) back into TF state; running image
flipped to :latest -> now on v.1.1.1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Namespace trip-planner (tier=4-aux, keel enrolled), ExternalSecret pulling
secret/trip-planner from vault-kv, DB-creds ExternalSecret from vault-database
(static-creds/pg-trip-planner → asyncpg DSN), Deployment with migrate init
container + main container (readiness+liveness /healthz, 256Mi req=limit, 100m
cpu request), ClusterIP service port 8080, and ingress_factory with auth=none
(Slack v0 HMAC signature verification in-app). Terraform fmt clean. NOT applied;
requires Vault secret/trip-planner + CNPG trip_planner DB + Slack app config.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>