The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.
Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
.claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
"invite the app / flip back to #security" caveats and state the
#security abandonment + #alerts consolidation as the current routing.
Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):
- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
(the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
(prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
#62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
edge table (feeds code-8ywc; enforce-flips out of scope).
Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.
This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
allowlist comment claimed council-complaints as the last referencer;
rewrite it (no live workload pulls from that registry now; only stale
completed Job records still carry the ref). The allowlist line itself
is kept (registry-scoped, not app-specific).
Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.
- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
--provider github` (name "github", matching the callback registered on
the GitHub OAuth App). Like the existing Authentik source, it lives in
Forgejo's DB rather than Terraform — there's no clean TF resource for
login sources. Client id/secret mirrored to Vault secret/viktor
(forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
[oauth2_client], so a first GitHub sign-in creates the account directly
("sign up with GitHub") instead of a link-to-existing detour. The
GitHub identity is the trust gate for this path; Turnstile + email
confirmation still gate the native form.
Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.
Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:
- Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
is managed in Terraform (turnstile.tf) via the CF Global API key, so
the sitekey/secret are IaC, not a dashboard artifact.
- Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
(mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
credential Authentik uses (email-secret.tf ESO -> secret/authentik
smtp_password).
Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.
Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.
Runbook: docs/runbooks/forgejo-open-signups.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the new paperless-ai service and the two non-obvious operational
facts: runtime config lives in the PVC .env (not TF env, which would shadow
it), and Qwen3 needs /no_think for parseable tagging output.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deploy a small stateless anisette-data server so the TripIt iOS Shell can be
sideloaded with SideStore using a free Apple ID, without brokering the
Apple-ID auth dance through a public third-party anisette server (which would
see every login). SideStore points at a stable internal endpoint we control.
- Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server
for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub
releases / semver / sha tags), so pinned by manifest digest instead of a tag
per the "never :latest" rule. Pulled from DockerHub via the registry-VM
pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so
a new upstream build prompts a digest re-pin.
- Stateless: emptyDir backs the provisioning-library cache dir (regenerable
download; upstream issue #23 means it doesn't preserve client auth across
restarts anyway) — no PVC, no Vault secret.
- Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none,
allow_local_access_only, ssl_redirect off) — SideStore is a native client
that can't do the Authentik cookie dance, same reasoning as android-emulator's
adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never
publicly exposed.
Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service
catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ADR-0002 is fully landed (issues #11-#32 closed): every owned image now
builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/<name>, with
Woodpecker reduced to deploy-only. The Forgejo container registry is frozen
and emptied; there are no in-cluster image builds or CI test runs anywhere.
The docs still described the old hybrid topology (DockerHub builds,
Woodpecker-native owned-app builds, the per-pattern migration lists, the
tripit-only pilot framing), which would mislead future sessions and
incident response.
This brings the docs to the completed reality (closes#33):
- docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference —
the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package
split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen
Forgejo registry, what Woodpecker still runs, and the #31 decommissions.
- .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the
fleet-wide final state; FIX the stale claim that claude-memory-mcp builds
to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the
Forgejo registry is frozen/break-glass near the image-registry bullet.
- .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker
deploy-only (was "Woodpecker-native build->deploy").
- stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf:
cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no
CI pipeline). Description/comment text only — no stack logic changed.
Historical records (docs/post-mortems/*, docs/plans/*) and ADR-0002 itself
are left untouched as point-in-time records.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
The raw string compare never matched qm config's canonical key order, so
the hourly timer re-issued 'qm set' against every running capped VM,
live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's
devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU
(blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi
controller path with no iothread.
Viktor asked to root-cause the freeze before choosing fixes, then approved
mitigating via VM settings: this commit fixes the hourly trigger and
documents the incident; the controller swap (virtio-scsi-single +
iothread=1 + aio=threads) is staged on VM 102 separately, pending his
cold stop/start.
Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain,
ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md
+ proxmox-inventory.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to go through the agent's stored infra facts and straighten out anything wrong about what-is-where. Cross-checking docs against the live cluster surfaced doc drift alongside the stale memories:
- compute.md: add k8s-node5/6 (joined 2026-05-26) to diagram + node table; totals 48 vCPU / ~176GB -> 64 vCPU / ~240GB; cluster version v1.34.2 -> v1.34.8 (live-verified)
- storage.md: the nfs-proxmox StorageClass no longer exists (removed 2026-04-25, commit 484b4c71) — nfs-truenas is the only NFS SC; fixed three spots that told readers to use nfs-proxmox
- proxmox-inventory.md: k8s VM RAM rows live-verified via kubectl (master 32G, node1 48G, node2-4 32G — the old 16/32/24G figures predated the 2026-04-02 resize), added node5/6 rows, devvm swap 8G -> 14G (grown 2026-06-10), recomputed total (~288GB nominal of 272GB physical, overcommitted)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to add connection logs (Traefik/Cloudflare) to catch the
real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean
while real tunnel sessions cycle every 15-35s, so the drop originates
above t3-serve and we need to see which layer cuts the socket.
Traefik (/ws duration) and cloudflared (WS close events) already ship to
Loki; the gap was the devvm side. This adds:
- t3-dispatch logs every /ws open/close with dur_ms + cause:
downstream_closed (client/CF/Traefik hung up = last-mile/network),
upstream_closed (t3-serve closed/reset), or graceful. Graceful closes
previously left no trace (default ReverseProxy only logs on error), so a
watchdog-driven reconnect was invisible. Helpers unit-tested.
- devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch +
t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the
pve/rpi-sofia shippers. devvm was never in Loki (standalone VM).
Joined in Loki the three layers attribute any future drop to a segment
with no repro needed. Runbook + service-catalog updated.
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.
Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
pattern) so applies stop stripping live Keel state
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:
- values.yaml: the authentik.* Helm values (gunicorn workers, cache
timeouts, conn_max_age) were silently INERT because existingSecret
skips chart env rendering — pods ran defaults (2 workers, 300s
caches, no persistent DB conns). Moved all tuning into
server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
password_stage so username+password render on ONE screen (the
separate order-20 password binding is deleted via API — authentik
requires that when embedding). Outpost log_level trace->info and
1->2 replicas (it is on the hot path of every forward-auth request;
PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
Cache-Control (assets are version-fingerprinted but served with no
max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
opening a fresh TCP connection to the outpost per subrequest) +
config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
(it is a live CNPG primary-selector compatibility service).
Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.
The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.
Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
Investigated the 0.0.25 break: it is ONLY an endpoint rename
(/api/auth/bootstrap -> /api/auth/browser-session). The rest of the pairing
contract (credential payload, t3_session cookie, /api/auth/session) is
byte-identical, verified in isolated 0.0.24-vs-0.0.25 sandbox serves. So a
future pin bump is now safe + reversible (pin STAYS 0.0.24 — this is prep):
- t3-dispatch: autoPair tries /api/auth/browser-session, falls back to
/api/auth/bootstrap on 404 — one binary pairs across both versions and any
rolling-restart skew. TDD via TestAutoPairAcrossVersions (red on 0.0.25
before, green after). Built, deployed, verified live on 0.0.24 (all three
users still 302 + t3_session via the fallback).
- t3-autoupdate.sh: health-check now exercises the REAL mint->credential->cookie
handshake (was GET / -> 200, which passed the pairing-broken nightly). A bad
build now auto-rolls-back. Validated against both versions.
- t3-backup-state.{sh,service,timer}: daily online VACUUM INTO of each ~/.t3
state.sqlite (was the only copy, unbacked) -> the one-way forward schema
migration becomes a restore, not sqlite surgery. timeout-guarded.
- runbooks/t3-version-bump.md: the reversible cutover checklist.
- post-mortem #5 (health-check) DONE + #6 added; service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The t3-autoupdate timer (re-enabled by the provisioner's step 5b with
`--now`, which fires the missed daily job immediately on a Persistent
timer) pulled t3@nightly 0.0.25 mid-day. That build ran forward schema
migrations on every ~/.t3 state.sqlite (auth_pairing_links/auth_sessions
role->scopes, +proof_key_thumbprint) AND changed the bootstrap API,
breaking t3-mint/pairing for ALL devvm users (pair prompt, no session).
- t3-autoupdate.sh: now a pinned-version ENFORCER (T3_PIN=0.0.24), not a
nightly tracker -- re-asserts the pin (a no-op when correct).
- t3-provision-users.sh step 5b: drop `--now` (it triggered the
immediate missed-job run that pulled the bad build).
- setup-devvm.sh: install pinned t3@0.0.24 at machine setup.
- unit Descriptions + service-catalog reflect the pin.
- post-mortem: 2026-06-09-t3-nightly-autoupdate-auth-outage.md.
Host already reconciled out-of-band: rolled back to 0.0.24, re-enabled
the (now-pinned) enforcer, reset the 2 new users' disposable DBs,
surgically reverted wizard's auth tables to level-30 (96 threads + live
session preserved). All users verified 302 + t3_session.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
multi-tenancy.md: new DevVM Workstation section (roster SSoT, tiers, config inheritance, locked clone, built-vs-gated status). service-catalog.md t3code row: corrected the stale 'source of truth = /etc/ttyd-user-map' (now roster.yaml; the map/dispatch are GENERATED). offboard-user.md: written (was a referenced-but-missing dead link) — staged reversible-cut-then-gated-destructive for both cluster + workstation surfaces.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
devvm is the t3code Workstation host. Added an 8 GiB swapfile (swappiness=10, fstab-persisted) to turn multi-user OOM-kills into graceful paging (was 0 swap, ~1.2 GiB free of 23). Capacity budget: ~4-5G RAM per active user, max ~3-4 concurrent active sessions.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The migration left a UI-managed (not TF) Authentik OIDC app orphaned — tuwunel
uses native password auth, so nothing consumed it. Deleted application `matrix`
+ OAuth2 provider pk=6 via the Authentik API (user-confirmed). Drop the stale
Matrix rows from the SSO reference tables and update the plan's residual list.
Doc-only [ci skip].
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the cramped Synapse deployment with tuwunel v1.7.1: embedded RocksDB
drops the CNPG dependency (both init-containers, the db ESO, the Reloader
annotation all gone), env-var config, fsGroup-owned encrypted PVC, federation
on, tuwunel-served well-known delegation to :443. server_name unchanged
(matrix.viktorbarzin.me); fresh start (no Synapse->RocksDB migration path).
Registered @viktor admin then disabled registration (403).
Cleanup: removed the orphaned pg-matrix Vault static role and dropped the
matrix Postgres DB/role; updated service-catalog, upgrade-config, CLAUDE.md
PG-rotation list, and the Matrix OIDC->orphaned auth notes. Design+plan in
docs/plans/2026-06-08-matrix-synapse-to-tuwunel-*.
Already applied via scripts/tg (matrix tier-1 + targeted vault tier-0), so
[ci skip] to avoid CI reconciling an unrelated pre-existing vault OIDC
tune-TTL drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI/CD for the stem95su site is intentionally ON-DEMAND, not a CronJob:
the content is short-term and a scheduled job + Vault secret + ESO +
GCP "publish to Production" would be rotting artifacts. Instead, mirror
the source Google Drive folder "claude" → /srv/nfs/stem-site via a
throwaway rclone container using the existing google_workspace OAuth
creds (secret/viktor), rsync to NFS with an empty-source guard, then
shred the temp config. Verified end-to-end. Recipe in claude-memory.
Doc-only: corrects the service-catalog update-mechanism note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New public static site at stem95su.viktorbarzin.me serving the school's
Bulgarian STEM platform (dashboard + lessons/games, externally authored
HTML/media exported from Gemini).
- Stock nginx:1.28-alpine serving /srv/nfs/stem-site read-only (nfs_volume),
NOT image-baked — content updated out-of-band (Nextcloud "PVE NFS Pool"
or rsync), no rebuild; auto-backed-up offsite by nfs-mirror.
- ingress_factory auth="none" (open; CrowdSec + ai-bot-block at the edge),
dns_type="proxied" (Cloudflare CNAME auto-created).
- nginx ConfigMap sets index stem_board.html (the dashboard) for "/".
- Docs: service-catalog entry + new "Static Site Hosting" pattern
(NFS-backed vs image-baked) in patterns.md.
Applied via scripts/tg apply; verified live end-to-end (dashboard, 20MB
page, video byte-range, no Authentik redirect) through the public
Cloudflare path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Root cause of "barely serving films": Real-Debrid's May-2026
infringing_file/HTTP-451 filter blocks WEB-DL releases (which dominate
new content), while degraded sources starved candidates. RD account +
popular-title availability were healthy throughout (library 32/36
unrestrict OK; Matrix 897 / Dune2 694 / Oppenheimer 672 streams).
Runtime config (AIOStreams PG, applied via API — not in this diff):
- Comet timeout 5s -> 10s. Comet is the workhorse (~450+ streams/title)
and was silently dropping the bulk of its results at the 5s cutoff;
Interstellar 430 -> 987 streams after the bump.
- Removed MediaFusion preset: broken upstream ("Invalid configuration"
-> 500 Internal Server Error), contributed 0 usable streams, only a
dead [X] entry in every list.
This diff (Terraform):
- Harden aiostreams-stream-probe: test series AND movie paths, per-source
breakdown (comet/torrentio/stremthru_torz/knaben), error-stream count,
success gated on Comet being alive. The old probe counted only Breaking
Bad streams and stayed green while new-content playback was broken.
- service-catalog: reflect source set + probe behaviour.
[ci skip] — probe already applied via targeted `tg apply` + verified
(series=378 movie=898 comet=206 errors=0 success=1); skipping the full
servarr reconcile to avoid touching unrelated pre-existing drift
(qbittorrent MetalLB annotation, tls_secret cert revert).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stand up upstream TREK (mauriceboe/trek:3.0.22, AGPL) as a low-commitment
trial to evaluate the self-hosted group-trip use case before building a
custom app. Solo, single shared instance, Authentik forward-auth.
- stacks/trek: namespace, deployment (pinned, TF-managed, no CI/Keel),
service 80->3000, ingress_factory auth=required + proxied DNS at
trek.viktorbarzin.me, TLS. Two proxmox-lvm-encrypted PVCs (SQLite data +
uploads) -- encrypted per the sensitive-data rule and to avoid the
SQLite-over-NFS locking hazard.
- Trial secrets posture: ENCRYPTION_KEY auto-generated on the data PVC,
bootstrap admin in pod logs -- no Vault/ESO. Graduation TODOs documented
in main.tf + service-catalog (Vault key, app-level SQLite backup, OIDC SSO).
- kyverno: add mauriceboe/* to require-trusted-registries allowlist (the
policy is Enforce since 2026-05-19 -- also fixed the stale "stays in
Audit" header comment that said otherwise and misled the deploy).
- Runs free on OpenStreetMap (no paid maps key). Rallly availability-poll
companion deferred per solo-trial scope.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Gmail All-Mail scrape (tripit-ingest-mail) is retired — Viktor only wants
mail ingested when forwarded to plans@viktorbarzin.me, and only from actual
users. Dropped the ingest-mail CronJob and removed MAIL_DEFAULT_OWNER_EMAIL
from ingest-plans (the app now ignores mail from non-users instead of filing it
under the default owner). ingest-plans already carries EMAIL_PROVIDER/SMTP_* for
the new sender notifications. Service-catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f1-stream was extracted to its own Forgejo repo + deployed from the Forgejo
registry (2026-06-05). Correct the stale "Migrated to GHA / repo id 10" claims:
- CLAUDE.md + ci-cd.md: move f1-stream from the GHA list to the Woodpecker-native
owned-app group; note old github source archived + GHA Woodpecker repo 10
deactivated; f1-stream is now Woodpecker repo 166.
- service-catalog: note the source repo + deploy model.
Add a proxmox-lvm-encrypted RWO PVC (tripit-personal-documents) mounted at
/data/personal-documents on the app container, PERSONAL_STORAGE_DIR env, and the
DOCUMENT_ENCRYPTION_KEY ExternalSecret entry (seeded in Vault secret/tripit). A
root chown init-container makes the block volume writable by the non-root app
without touching the NFS doc vault. Backs the new owner-only encrypted personal
document vault in the tripit app.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Commit 63fe7d2b (fan-control) was made with a bare `git commit` in the
shared infra working tree and inadvertently swept in a parallel session's
staged f1-stream-extraction work (main.tf repoint, ~48 files/ removals,
ci-cd.md + .claude docs, two extraction plan docs).
This returns every f1-stream-related path to its pre-63fe7d2b state
(3493c347) so that extraction can be committed cleanly by its own
session. The fan-control files added in 63fe7d2b are untouched.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The iDRAC stock curve runs the CPU at ~72°C on the 7080 RPM floor even
under load (optimises for quiet, not cool). Add a bash daemon + systemd
unit that drives the chassis fans from CPU temp on two curves, picked by
garage occupancy (the server is in the garage): COOL when empty
(measured ~58-65°C under load), QUIET near the silent floor when the
ha-sofia garage door shows someone is there (open, or <15min since last
activity).
Manual fan mode is backstopped: bash EXIT trap + systemd ExecStopPost
hand fans back to Dell auto on stop/crash; CPU>=83°C or repeated IPMI
failures do the same. Pushgateway metrics (job=fan_control). 36 unit
tests cover the pure curve/hysteresis/presence/parse logic; DRY_RUN +
RUN_ONCE for integration checks. Deployed and verified on 192.168.1.127
(CPU 70->58°C in cool mode, hysteresis stepping confirmed).
Design: docs/plans/2026-06-04-pve-fan-control-design.md
Runbook: docs/runbooks/fan-control.md
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reflect the dashboard-nav-readonly ClusterRole: namespace-owners can list
namespaces/nodes (for dashboard nav) but not read other tenants' resources.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Update authentication.md, multi-tenancy.md, service-catalog, add-user skill to
reflect the token-injector (X-authentik-username -> SA token -> Bearer). Note the
extra k8s-dashboard apply needed when onboarding a namespace-owner (injector map
regen). [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Correct the docs I'd written for the (reverted) oauth2-proxy SSO. Reality:
apiserver OIDC rejects all Authentik tokens (design §12), so the dashboard
uses forward-auth (admits kubernetes-* groups) + per-namespace SA token-paste.
Updates authentication.md, multi-tenancy.md, service-catalog, authentik-state,
and add-user skill (onboarding now documents the dashboard token). oauth2-proxy
+ k8s-dashboard OIDC app noted as idle. [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Update authentication.md (structured multi-issuer AuthenticationConfiguration
+ dashboard SSO flow), multi-tenancy.md (web dashboard access), authentik-state
(new k8s-dashboard app + gheorghe groups), service-catalog (dashboard auth),
and the k8s-version-upgrade runbook (kubeadm wipes --authentication-config →
re-apply rbac post-upgrade). Design/plan addenda record the issuer-constraint
pivot from the original dual-aud approach. [ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move t3 from pinned stable (0.0.24, catalog capped at opus-4-7) to the nightly
channel so new models (Opus 4.8) land as t3 ships them. t3-autoupdate (daily
systemd timer) pulls t3@nightly, but applies the Keel-incident lesson: it
health-checks the new binary on a throwaway serve and AUTO-ROLLS-BACK on
failure, and restarts only IDLE per-user instances (defers any with an active
agent child) so an in-flight session is never killed by an update.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
t3 is single-owner (no in-app multi-user), so each person runs their own
`t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service),
emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the
Authentik-injected X-authentik-username to the right instance; unmapped
identities get 403 (no shared fallback). Flipped the ingress auth app→required
(Authentik forward-auth) — the same-origin self-served UI works behind it (WS
carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate.
Mirrors the terminal stack's per-user model.
Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403;
t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes
intentionally unsupported here — deferred until the native app is published.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints →
10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied,
auth="app"). t3 ships its own owner-pairing + bearer-session auth, so
Authentik forward-auth is intentionally omitted — it would break the
cross-origin native mobile app and app.t3.codes (bearer-only, no
Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier)
rate-limit the public surface; t3's pairing is the gate. TLS is
auto-synced into the namespace by Kyverno's sync-tls-secret policy.
Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200.
Trade-off (public RCE surface behind app-native auth, no Authentik SSO)
accepted 2026-06-01 to keep the native app + app.t3.codes working.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203),
which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688
failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only
vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public
WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to
vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts
-> bare kms-web-page service) so `iwr | iex` downloads the real script instead
of the PoW challenge HTML.
Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then
slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202).
Docs: kms-public-exposure runbook + service-catalog entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add tripit (self-hosted TripIt-clone travel-itinerary PWA) to the
service catalog Optional tier and Non-Proxied DNS list, and to the
CNPG consumer + PostgreSQL rotation lists in the databases doc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- New stack `paperless-mcp` running barryw/PaperlessMCP v0.1.19 (.NET,
HTTP+SSE on :5000) wraps paperless-ngx's built-in FTS. 43 tools exposed.
- In-cluster only egress to paperless-ngx svc; no Cloudflare hop on
MCP-internal traffic.
- Read-only at paperless layer: dedicated `claude-mcp` user (non-superuser)
in new `claude-mcp-readers` group with view-only Django perms; existing
279 docs bulk-granted view perm via /api/documents/bulk_edit/;
workflow #2 auto-grants the group on new docs (Consumption Added).
- Gateway-level bearer auth via new Traefik plugin
Aetherinox/traefik-api-token-middleware@v0.1.4 (loaded in traefik stack
alongside crowdsec-bouncer); per-stack Middleware CRD `bearer-auth`
pulls token list from Vault `secret/paperless-mcp/bearer_tokens`.
- Vault `secret/paperless-mcp` holds: paperless_api_token (synced to
K8s Secret via ESO; pod env via secret_key_ref), bearer_tokens
(JSON array, read at plan time), bearer_token_viktor_laptop (mirror
for laptop wiring), paperless_user_password (paperless UI fallback).
- Image auto-update via Keel (semver minor policy, hourly poll).
- Ingress dns_type=proxied → Uptime Kuma external monitor auto-created
by external-monitor-sync CronJob.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>