Commit graph

3926 commits

Author SHA1 Message Date
Viktor Barzin
0c7ec3d470 tripit: Gmail ingest (12-month) + vbarzin owner + plans@ forward-to-parse
Reconciles the tripit stack source with live state and adds the forward
flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a
rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded),
filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's
Authentik login identity). Adds an ingest-plans CronJob that polls spam@
filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target)
so forwarded bookings are extracted and attached to the matching trip;
IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:24:25 +00:00
Viktor Barzin
fd35c4f303 pfSense: LAN-side NAT redirect for mail ports landing on Traefik LB IP
Technitium's split-horizon rewrites *.viktorbarzin.me to 10.0.20.203
(Traefik LB) for the 192.168.1.0/24 Barzini WiFi (TP-Link router has
no hairpin NAT). The rule is name-agnostic so mail.viktorbarzin.me
(and imap./smtp.) get sent to .203 too — where Traefik does not
listen on 25/465/587/993. iOS Mail on Barzini WiFi silently hangs
while Roundcube (port 443 via Traefik) keeps working.

Adds pfSense NAT rdr rules so traffic to 10.0.20.203:{25,465,587,993}
gets redirected to 10.0.20.1 (the mail HAProxy listener already
serving the public path). Loaded on every incoming interface by
pfSense rule generation, so any LAN/VPN client falling into the
split-horizon answer lands on the right service unchanged.

Includes idempotent reproducer script (mirrors the existing
pfsense-haproxy-bootstrap.php pattern) and the networking.md
mail carve-out paragraph plus the stale .200 → .203 reference.
2026-06-03 10:24:25 +00:00
Viktor Barzin
ff26d1c957 openclaw: give recruiter-api plugin the Telegram bot token so it can announce
The recruiter-api plugin's announceEvent() sends recruiter cards to Telegram
via OPENLOBSTER_CHANNELS_TELEGRAM_TOKEN (its fallback path, since OpenClaw
doesn't pass api.bot to "kind: tools" plugins). That env was never set in the
container, so every hourly poll threw on the send, events were never marked
consumed, and no Telegram notification ever went out — the rest of the
"recruiter pipeline has no responses" problem (the GPU/triage half was fixed
separately). Wire it from openclaw-secrets.telegram_bot_token (same token as
channels.telegram.botToken). Verified: the 3 backlogged events were announced
+ consumed on the openclaw restart.

Drafting (the /api/draft 500 that also degraded the cards) was fixed in
parallel by swapping Vault secret/recruiter-responder gpt_mini_model from the
slow/timing-out qwen3-coder-480b to meta/llama-3.3-70b-instruct (~1.6s).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:24:25 +00:00
root
c85533d2d9 Woodpecker CI deploy [CI SKIP] 2026-06-03 10:24:25 +00:00
Viktor Barzin
982dc9e63a openclaw: task-webhook ingress auth required->none (inbound Forgejo webhook)
The task-webhook host is an inbound webhook receiver: Forgejo (a machine
with no Authentik SSO cookie) POSTs issue/comment events, so forward-auth
302-bounced every delivery and silently dropped all webhooks. Flip only
this ingress to auth=none; the do_POST handler gates on payload action +
bot-user filtering. Gateway (openclaw) and openlobster stay auth=required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-03 10:24:25 +00:00
root
91d110acf5 Woodpecker CI deploy [CI SKIP] 2026-06-03 10:24:24 +00:00
Viktor Barzin
fde2d19bf7 trading-bot: ingress auth required->app (app has own WebAuthn/JWT)
The app ships complete auth — WebAuthn/passkey (RP_ID=trading.viktorbarzin.me)
+ JWT bearer on every /api/* route + a /ws?token=<JWT> WebSocket. Authentik
forward-auth on / was 302-bouncing the WebAuthn XHR flow and the WS upgrade,
making the app unusable. Flip to auth = "app" so the backend's own auth is the
gate (same-origin SPA + bearer-token API, same pattern as immich). Verified all
11 route modules enforce Depends(get_current_user) and dev_mode defaults False
before flipping.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-03 10:24:24 +00:00
Viktor Barzin
e18e0d51a0 uptime-kuma: public status pages + push monitors bypass Authentik
The single uptime ingress gated the ENTIRE site (path "/") behind
Authentik forward-auth, so public-by-design endpoints 302-bounced to
SSO: status pages (/status/<slug>), push-monitor ingest
(/api/push/<key>), status-page API + heartbeat (/api/status-page),
badges (/api/badge), and static assets. Status pages are for
logged-out viewers and push monitors POST from machines — neither can
follow the Authentik OAuth cookie dance, so all were broken.

Fix mirrors the meshcentral agent carve-out (9a15f3f2): add a second
path-scoped ingress_factory (auth="none") pointing at the same
uptime-kuma Service. Traefik routes longest-rule-first, so these
out-prioritise the "/" catch-all; the dashboard (/, /dashboard,
/manage-*, /settings, etc.) stays Authentik-gated via the original
ingress. WebSocket status UI keeps working — the default middleware
chain passes Upgrade/Connection through.

Verified: /status/infra, /api/status-page/{,heartbeat/}infra,
/api/badge no longer 302 (200); / still 302s to authentik.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-03 10:24:24 +00:00
root
17f91f6167 Woodpecker CI deploy [CI SKIP] 2026-06-03 10:24:24 +00:00
Viktor Barzin
bc5aba34b6 meshcentral: fix agent connectivity behind Authentik + TLS-offload Traefik
Two root causes kept all 8 mesh agents (incl. family laptops) offline:

1. The single ingress gated the ENTIRE site (path "/") behind Authentik
   forward-auth, so the agent/relay endpoints (/agent.ashx, /meshrelay.ashx,
   /control.ashx, etc.) got 302-bounced to SSO. Native mesh clients can't do
   the OAuth cookie dance. Fix: add a second ingress_factory (auth="none")
   path-scoped to the agent endpoints, pointing at the same meshcentral
   service. Traefik routes by rule length so these out-prioritise the "/"
   catch-all; the human web UI stays Authentik-gated.

2. After the auth fix, agents reached /agent.ashx but were rejected with
   "Agent bad web cert hash" — MeshCentral pins the OUTER TLS cert, but with
   TLS offload the agent sees Traefik's Let's Encrypt cert (which differs
   between the internal .203 LB and the external Cloudflare path, and rotates
   monthly), not MeshCentral's own webserver cert. Fix: set
   ignoreAgentHashCheck=true in the init-container config so MeshCentral
   echoes back the agent-reported hash. The separate mesh-certificate
   (ServerID) handshake still authenticates the server.

Verified: agent paths no longer 302->authentik; web UI root still does;
laptop "Valia_Laptop" enrolled in group "laptops" and ONLINE.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-03 10:24:24 +00:00
Viktor Barzin
01ea7d6fa1 immich: clip-keepalive CronJob to pin smart-search model warm
MACHINE_LEARNING_MODEL_TTL=600 is a single global knob, so it unloads the
CLIP textual (smart-search) encoder after idle exactly like OCR/face —
immich has no per-model pin. This CronJob pings the textual encoder every
5 min (< the 600s TTL) via immich-ml /predict, so a search query never
pays the ~1.5s cold-load, while idle OCR/face still free their VRAM on the
shared T4. Textual-only (search = text->embedding->pgvector); the visual
encoder is import-time and left to unload. curl baked into the image (no
runtime install).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:24:24 +00:00
Viktor Barzin
f0948493b3 claude-agent-service: wire parallel execution (git-crypt mount, memory, MAX_CONCURRENCY)
The service now runs agent calls concurrently (bounded semaphore, per-job
isolated clones) instead of single-flight. Infra side:
- mount git-crypt-key into the main container (each job re-unlocks its own clone)
- MAX_CONCURRENCY=10 env (excess calls queue FIFO)
- bump pod memory 2Gi req / 12Gi limit, cpu req 1 (Burstable, tier-aux) — sized
  for ~10 concurrent claude+terraform runs; fits node2/3/5 headroom
- docs: beads-auto-dispatch + automated-upgrades no longer describe single-slot

Service code: viktor/claude-agent-service @ 66104a3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 10:24:24 +00:00
Viktor Barzin
16763464cd job-hunter dashboard: role panels now respect the $location filter
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The role panels (Top roles, Top companies by role volume, New roles/day,
Roles by source, Salary distribution) had no location filter, so they showed
all locations regardless of the $location dropdown. Add
'primary_location IN (${location:sqlstring})' to each (matching the comp
panels' pattern). Also switch the 'Your comp vs the market' panel from
hardcoded 'london' to the same $location filter for consistency. Data was
fine (all london-tagged roles genuinely contain 'london').

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 23:35:25 +00:00
Viktor Barzin
7a7abe4cbe uk-payslip dashboard: count gross comp on taxable_pay (P60) basis
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
The 'Yearly receipt' + 'YTD gross salary' panels summed salary+bonus+rsu_vest
(rsu_vest = net/partial RSU), understating gross by ~£73k/yr. Switch to
COALESCE(taxable_pay, gross_pay) + pension_sacrifice = true P60 gross (verified:
23/24 -> £286,288, 25/26 -> £416,646, matching the P60 + job-hunter realized
bar). 'Yearly receipt' rsu_gross is now the real gross RSU (£150k/£271k, not
£70k/£128k). Relabel the Sankey RSU inflow 'RSU (net vested)' for honesty;
leave cash-flow/net_pay + the (taxable_pay-based) reconciliation/rate panels.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 23:23:15 +00:00
Viktor Barzin
aa0d6511b2 job-hunter runbook: document two self baselines + taxable_pay gotcha
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Dashboard now shows two 'Me' bars: realized gross (~£409k, from
SUM(payslip taxable_pay) = P60 basis) and package/grant-value (~£267k,
levels.fyi-comparable). Document that gross MUST come from taxable_pay, NOT
salary+bonus+rsu_vest (rsu_vest is net/partial, understates RSU ~50%).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 23:13:35 +00:00
Viktor Barzin
50a4ad70f0 job-hunter runbook: self-comp re-seed stores full TC breakdown
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
total_value (what the comparison bar uses) must be full TC; document storing
base+bonus+RSU components too so it's verifiable that RSU+bonus are included.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 22:23:42 +00:00
Viktor Barzin
deb0dd4778 monitoring: "Your comp vs the market" panel on Job Hunter dashboard
Add a barchart (panel 10) ranking every company's London p50 total comp
(COALESCE total/base) with the user's current comp shown in line, so it's a
direct "how do I compare" view. The user's figure is NOT hardcoded in the
dashboard JSON — it's a labeled comp_point in the DB (company_slug
'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number
out of git. It's below the £500k alert bar (no Slack ping) and ranks too low
to appear in analyze leaders. Runbook documents the panel + how to update the
baseline.

[ci skip] — dashboard ConfigMap applied locally (targeted).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 21:27:26 +00:00
Viktor Barzin
74313149dd job-hunter: weekly above-target Slack alert CronJob
Add job-hunter-alert CronJob (Sundays 05:00 UTC, an hour after the refresh):
`python -m job_hunter alert --threshold 500000 --location london --slack`
posts to Slack the companies whose London p50 total comp >= £500k, flagging
any that newly crossed since last week's snapshot. SLACK_WEBHOOK_URL wired via
the job-hunter-secrets ExternalSecret from Vault secret/job-hunter
slack_webhook_url (seeded from the shared workspace webhook; repointable to a
dedicated channel). Runbook gains an "above-target Slack alert" section.

[ci skip] — applied locally (stack-scoped).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:49:42 +00:00
Viktor Barzin
5dc5cd53c0 url/shlink: ingress url.viktorbarzin.me auth required -> none
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Authentik forward-auth on the shlink REST API + short-link domain
(url.viktorbarzin.me) 302s shlink-web's cross-origin API XHR (CORS
preflight) and SSO-bounces every public short link. Result: the admin
UI showed "Something went wrong while loading short URLs" and short
links never resolved for logged-out clients.

The shlink REST API is self-gated by its X-Api-Key and short links are
public by design, so Authentik must not front this domain. CrowdSec +
rate-limit + anti-AI bot-block still apply. The admin web UI
(shlink.viktorbarzin.me) stays auth=required via module.ingress-web.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:37:33 +00:00
Viktor Barzin
fe8db19aaf job-hunter: build-triggers-deploy model; CronJob :latest + docs
CI now drives the Deployment rollout (kubectl set image to the build SHA in
.woodpecker.yml), so the stack moves to image_tag = "latest": the Deployment
runs whatever CI last set (image ignore_changes keeps TF from fighting it),
and the CronJob uses :latest + imagePullPolicy=Always (fresh pod each weekly
run). Keel stays enrolled in parallel as a redundant net.

Docs: rewrite the runbook "Deploying" section for build-triggers-deploy;
record the reversal of decision #12 in the auto-upgrade design doc (owned
apps drive their own rollout, Keel parallel — upstream stays Keel-only); add
the owned-app deploy model to infra/.claude/CLAUDE.md CI/CD section.

[ci skip] — applied locally (stack-scoped); avoids a broad CI auto-apply.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:24:50 +00:00
Viktor Barzin
052c776eba immich: set MACHINE_LEARNING_MODEL_TTL 0->600 to stop GPU VRAM hog
immich-ml at TTL=0 never unloaded models; a heavy OCR library job
inflated onnxruntime's CUDA arena to ~10.7GB and held it on the shared
time-sliced T4, starving llama-swap (qwen3-8b) so recruiter-responder
triage 502'd silently for hours (emails preserved unseen, no loss).
TTL=600 lets idle ad-hoc models (OCR, face) free VRAM while preloaded
CLIP/smart-search stays warm.

Docs: correct stale llama-cpp GPU notes (T4 is time-sliced, no VRAM
isolation; add qwen3-8b to model table), immich MODEL_TTL gotcha in
.claude/CLAUDE.md, and a post-mortem.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 20:16:11 +00:00
Viktor Barzin
cda858d560 job-hunter: weekly refresh CronJob + ops/analyst runbook
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ci/woodpecker/push/build-cli Pipeline was successful
Add kubernetes_cron_job_v1.job_hunter_refresh — Sundays 04:00 UTC, runs
`refresh --source ats --source hn --source levels_fyi`, which upserts roles/
comp AND appends the dated comp_snapshots/roles_snapshots series consumed by
`job-hunter analyze`. Mirrors the Deployment's alembic-migrate init container
so a refresh never runs against an un-migrated DB; concurrency Forbid,
backoff 1, 30m activeDeadline, KYVERNO_LIFECYCLE_V1 dns_config ignore.

Add docs/runbooks/job-hunter.md: ops (health checks, manual refresh, add an
ATS company / CDIO watch, secret bag + rotation, failure table, TF apply) and
analyst (the analyze report, query recipes, SQL trend queries against the
snapshot tables, interpretation caveats) sections.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 19:37:57 +00:00
Viktor Barzin
87f1dcb72d wealth: consolidation chunk 2 — net-pay $grain merge, Trend projection, row reorg
Completes the 36->17 consolidation:
- 3 net-pay panels -> 1 "Net pay vs market gain (${grain})" with a cumulative/
  yearly/monthly dropdown (Mixed datasource: payslips-pg + wealth-pg).
- Projection rebuilt as a Trend panel (numeric "Years from today" x-axis) so it
  renders regardless of the dashboard time range — fixes empty-by-default. Drops
  the duplicate projection-row stat cards + the how-to-view text panel.
- Full reorg into 7 collapsed rows: Overview / Net worth over time / Returns &
  contributions / Income vs market / Holdings / RSUs (META) / Projections.

All wealth-pg SQL validated live; net_pay target reuses the existing payslips-pg
source. Visual review pending.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
848cc7211f t3code: track t3 nightly via health-checked auto-updater
Move t3 from pinned stable (0.0.24, catalog capped at opus-4-7) to the nightly
channel so new models (Opus 4.8) land as t3 ships them. t3-autoupdate (daily
systemd timer) pulls t3@nightly, but applies the Keel-incident lesson: it
health-checks the new binary on a throwaway serve and AUTO-ROLLS-BACK on
failure, and restarts only IDLE per-user instances (defers any with an active
agent child) so an in-flight session is never killed by an update.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
de09e8f294 immich runbook: note force=false re-kick gotcha after row deletion [ci skip]
The videoConversion enqueue is an async scan; deleting encoded_video rows while a
prior scan is in-flight misses them (observed 2026-06-02: 11/3296 picked up on the
first pass). Re-trigger force=false once the queue first drains to waiting:0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
d27df1f321 t3code: dispatch — strip @domain from X-authentik-username (Authentik injects email)
Authentik injects the full email (e.g. vbarzin@gmail.com), but /etc/ttyd-user-map
and dispatch.json key on the local part (vbarzin), so every real login hit
403 'no instance provisioned'. Strip @domain before lookup, matching the
terminal stack's tmux-attach.sh. Verified: vbarzin@gmail.com / emil.barzin@gmail.com
-> 302 (own instance); unmapped/no-header -> 403.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
b651f137b9 docs(kms): SXSMSI/1603 is client-machine-specific (VM 300 pilot) + deep-repair/escalation
Pilot on PVE VM 300 established strong counterfactuals: identical kms-bootstrap +
the user's exact journey both reach office/ok on healthy Win10 (CF1 clean install,
CF2 retail O365HomePremRetail->targeted-remove->reboot->VL install). So a persistent
[Failing PreReq=SXSMSI]/1603 is the client's corrupted Windows servicing/Installer
subsystem (below DISM/SFC), not the script/ODT/KMS. Documents the consent-gated deep
repair, the DeepRepairDone marker + in-place-repair escalation, and the
low-disk/guest-agent-drop gotchas hit during the pilot.
2026-06-02 19:24:30 +00:00
Viktor Barzin
481585f6e6 immich: cap streaming transcode bitrate to fix 4K video stutter [ci skip]
Transcodes were uncapped (ffmpeg maxBitrate=0 + preset=ultrafast +
targetResolution=original) -> 77-264 Mbps 4K H.264 files. Mobile playback
streams that copy off the shared 7200rpm sdc pool over inter-VLAN NFS; a single
stream needs ~10-13.5 MB/s and stuttered for every client, local and remote.

Fix (DB system-config, applied via API): maxBitrate=20000k, preset=medium,
transcode=bitrate. 4K resolution preserved; originals never modified. Existing
oversized transcodes regenerated by deleting their asset_file encoded_video rows
+ videoConversion force=false (concurrency 1).

Document config + add runbook docs/runbooks/immich-transcode-bitrate.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
deec540fad t3code: docs — auto-provisioning service-catalog entry + design status implemented
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
a587f0ee55 t3code: ingress -> devvm dispatch+autopair (retire in-cluster nginx)
stacks/t3code now points the Authentik-gated ingress at the DevVM t3-dispatch
service (Service+Endpoints -> 10.0.10.10:3780) instead of the in-cluster nginx,
which is removed. Per-user routing + session auto-injection now live on DevVM.
Verified: external 302->Authentik; in-cluster vbarzin/emil.barzin->302 (auto-pair
to own instance), unmapped->403.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
9f551e3c13 t3code: harden dispatch — dedicated user + validated t3-mint + scoped sudoers
Run t3-dispatch as an unprivileged dedicated user instead of wizard (who has
full sudo). Privileged minting goes through /usr/local/bin/t3-mint, which
validates the target against /etc/ttyd-user-map before minting as that user;
sudoers permits t3-dispatch to run only that wrapper. Compromise of the
network-facing service can mint pairing tokens for mapped users at most.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
0472f67d49 t3code: devvm dispatch + auto-pair service (Go)
Routes X-authentik-username -> per-user t3 instance; on no t3_session
cookie, mints a pairing token (as the OS user) and exchanges it at
/api/auth/bootstrap, injecting the session cookie. Listens :3780, reads
/etc/t3-serve/dispatch.json. Constants from the Task-1 auth-contract spike.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
72aba7da32 t3code: reconcile per-user t3 instances from /etc/ttyd-user-map
Sticky port allocation (3773+), enables t3-serve@<user>, emits
/etc/t3-serve/dispatch.json for the dispatch service. systemd timer
(OnBootSec+hourly) mirrors the apply-mbps-caps pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
f8a63fdacd t3code: per-user t3-serve@ systemd template (User=%i file isolation)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 19:24:30 +00:00
Viktor Barzin
2152430b70 docs(t3code): record discovered t3 web-auth contract 2026-06-02 19:24:30 +00:00
Viktor Barzin
5e4f83d4e7 wealth: consolidation chunk 1 — merge NW/contribution/growth, returns table, yearly combo
36 -> 19 panels (chunk 1 of 2), zero metric loss:
- 3 NW/contribution/growth timeseries -> 1 "contribution vs market value (+growth)"
- 11 returns/Δ stat cards (12mo x3 + Δ 1d/7d/30d/90d all&mkt) -> 1 "Returns over
  time windows" table (window × Δall/Δmkt/return%)
- 2 yearly barcharts -> 1 combo (contributions/market-gain bars + return-% line,
  timeFrom=10y so full history always shows)

All SQL validated live. Chunk 2 (net-pay $grain merge, projection->Trend panel,
row reorg) to follow.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:27:09 +00:00
Viktor Barzin
a09b0b3612 docs(t3code): implementation plan for per-user auto-provisioning
Task-by-task plan pairing with the design doc: Task 1 discovers the t3
web-auth contract (cookie name + bootstrap body), then systemd template,
reconcile, devvm dispatch+auto-pair Go service, scoped sudoers, TF repoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:19:22 +00:00
Viktor Barzin
1a0647c7ed docs(t3code): design for per-user auto-provisioning (Authentik login → instance + session)
Approach 1: /etc/ttyd-user-map as source of truth; per-user t3-serve@.service
template (User=%i enforces file permissions); devvm reconcile; devvm
dispatch+auto-pair service (mints + injects the t3 session cookie on first
authenticated visit, replacing the in-cluster nginx). Spec for review before
writing the implementation plan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:10:05 +00:00
Viktor Barzin
55ed50b932 docs(plans): wealth dashboard consolidation design
Consolidate the wealth Grafana dashboard 36 -> ~17 panels with zero metric
loss: merge the 3 NW/contribution/growth timeseries into 1, the 11 returns/Δ
stat cards into 1 returns table, the 2 yearly barcharts into 1 combo, and the
3 net-pay-vs-market-gain panels into 1 (grain dropdown); reorganize into
collapsed rows. Also rebuild the projection as a Trend panel (numeric
years-from-today x-axis) so it renders regardless of the dashboard time range
(fixes empty-by-default). Philosophy: merge duplicates, keep every metric.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:52:59 +00:00
Viktor Barzin
73cb0aab8b t3code: per-user isolation via Authentik + nginx username dispatcher
t3 is single-owner (no in-app multi-user), so each person runs their own
`t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service),
emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the
Authentik-injected X-authentik-username to the right instance; unmapped
identities get 403 (no shared fallback). Flipped the ingress auth app→required
(Authentik forward-auth) — the same-origin self-served UI works behind it (WS
carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate.
Mirrors the terminal stack's per-user model.

Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403;
t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes
intentionally unsupported here — deferred until the native app is published.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:38:06 +00:00
Viktor Barzin
9fb3e6e851 docs: correct cloudflared-502 post-mortem + fix stale .200 Traefik ref [ci skip]
Real root cause of the 2026-06-01 full-site 502 was not a missed
reference but an out-of-band fix that Terraform reverted: the 2026-05-30
Traefik .200->.203 migration repointed the Cloudflare tunnel to the
Traefik service DNS via the CF Global API Key, but never landed that
change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01
reconciled live back to the stale .200, breaking all external ingress.
Rewrite the post-mortem around the "codify out-of-band fixes or TF
reverts them" lesson (a Terraform-Only-rule violation).

Also fix docs/runbooks/kms-public-exposure.md, which still claimed
Traefik served on 10.0.20.200:443 (now .203) — same migration fallout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:25:33 +00:00
Viktor Barzin
f807050eb5 cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip]
The Cloudflare tunnel routed *.viktorbarzin.me and the apex to
https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200
onto its dedicated 10.0.20.203 on 2026-05-30 (commit 0c01adac). Nothing
serves HTTPS on .200:443 anymore, so cloudflared could not reach its
origin (no route to host / i/o timeout) and Cloudflare returned 502 for
every externally-proxied service. Internal/LAN access (split-horizon ->
.203) was unaffected, which masked the outage.

Repoint both ingress rules at the in-cluster Traefik Service DNS
(https://traefik.traefik.svc.cluster.local:443) -- the design the docs
already described but the code never implemented -- so the tunnel is
decoupled from the Traefik LB IP and this cannot recur on a future move.

Applied live via targeted apply on the tunnel config resource only;
[ci skip] because live already matches and a full stack apply would
churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk).

Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:22:05 +00:00
Viktor Barzin
30a644d3cd docs(kms): document reboot-after-uninstall / 1603 handling + real-hardware status
The bundled consumer Office removal leaves a pending reboot; a same-run VL
install (or re-run before rebooting) fails with setup.exe 1603. Document the two
guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and
the on-disk completion poll. Record that the uninstall path is now verified on a
real M365 box (O365HomePremRetail removed) and the install needs a reboot first.
2026-06-01 21:22:05 +00:00
Viktor Barzin
a382683c0e infra: fix containerd forgejo-registry redirect .200->.203 (+skip_verify)
Traefik moved off shared .200 to its dedicated .203 on 2026-05-30, but the
containerd hosts.toml redirect for forgejo.viktorbarzin.me still pointed at the
now-dead .200:443 -> every FRESH forgejo pull failed (cached images kept
running, so it stayed hidden until a new image tag was pulled). Retarget to
.203 and add skip_verify (node dials Traefik by IP; cert is for
forgejo.viktorbarzin.me) in both the new-node cloud-init and existing-node
deploy scripts. Already rolled to all 7 nodes (rewrite + restart containerd,
no drain). Doc fix in .claude/CLAUDE.md.
2026-06-01 21:22:05 +00:00
Viktor Barzin
82855848d1 plans: TopoLVM migration evaluation (Path 3 for LUN-cap relief)
Decision-support doc, NOT a commitment. Evaluates whether replacing
proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling
permanently and at what cost.

Key trade-off documented: TopoLVM PVCs are pinned to the node where
the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs
migrate between VMs when pods reschedule. The data-locality penalty
matters most for single-replica stateful services (MySQL standalone,
Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed
apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft)
absorb it.

Three disk-layout options:
  A. Carve per-VM data disks from sdc — simple, no hardware,
     IO contention unchanged
  B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free
  C. Add a dedicated NVMe — also closes beads code-oflt (IO
     contention), ~£200 hardware investment

Effort estimate: 2.5-3 weeks of focused work for the full migration;
covers TopoLVM install, lvmd config, per-VM disk provisioning,
LUKS plumbing, 5 migration waves (regenerable → huge PVCs),
backup-pipeline rewrite, deprecation.

Recommended next step before committing: small pilot on
k8s-node5/6 with one non-critical PVC to validate the operational
pattern end-to-end.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative),
beads code-oflt (IO isolation).
2026-06-01 21:22:05 +00:00
Viktor Barzin
599d67db51 docs(kms): self-hosted ODT bootstrapper + anonymous client telemetry (kms-diag/Loki)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:22:05 +00:00
Viktor Barzin
f364399ede wealth: add 30y net-worth projection row + align net-pay panel
Implements the committed projections design (docs/plans/2026-05-28-wealth-
projections-{design,plan}.md): a collapsed "Projections" row on the wealth
dashboard with 5 template vars (rate_low/base/high, monthly_contribution=auto,
horizon_years=30), a multi-scenario projection panel (Low/Base/High + trailing-
3y historical line + a base-rate compounding-only line), 3 stat cards, and a
text panel with one-click future time-range links.

Projection is pure SQL over dav_corrected: compound + ordinary-annuity FV from
today's net worth; auto contribution = trailing-12mo run-rate (COALESCE/NULLIF
so $monthly_contribution=auto doesn't constant-fold 'auto'::numeric). Historical
rate = trailing-3-full-year geometric mean of per-year Modified-Dietz returns
(~10.4%) — all-time was a nonsense 83% because the all-accounts-complete window
is only ~4 months, and the true all-time geomean is skewed by 2021's +86%.

Also aligns "Net pay vs market gain — per month" to consecutive month-end
deltas (same fix as the other monthly panels). Verified all SQL live.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
32e1042ca8 t3code: expose t3 serve (DevVM) publicly at t3.viktorbarzin.me (app-tier)
New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints →
10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied,
auth="app"). t3 ships its own owner-pairing + bearer-session auth, so
Authentik forward-auth is intentionally omitted — it would break the
cross-origin native mobile app and app.t3.codes (bearer-only, no
Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier)
rate-limit the public surface; t3's pairing is the gate. TLS is
auto-synced into the namespace by Kyverno's sync-tls-secret policy.

Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200.
Trade-off (public RCE surface behind app-native auth, no Authentik SSO)
accepted 2026-06-01 to keep the native app + app.t3.codes working.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
c5e4b1ea71 kms: add /diag anonymous telemetry collector behind Anubis carve-out
The PowerShell activation scripts POST small JSON diagnostics to
/diag so script execution errors are captured. The collector
(python:3.12-alpine, ConfigMap-mounted) prints each event to stdout
as a KMSDIAG line; the cluster's Loki scrapes pod stdout, making
events searchable in Grafana (Loki only — no Slack, no Prometheus).

Like /scripts, /diag needs a second ingress_factory carve-out with
full_host="kms.viktorbarzin.me" so it bypasses the Anubis PoW
challenge that PowerShell/curl can't solve. Without full_host the
factory would derive kms-diag.viktorbarzin.me and the carve-out
would never match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
3fa9e2409c runbook: K8s worker scaling for PVC capacity headroom
Documents the 6-worker cluster shape (post 2026-05-26 scale-up after
the proxmox-csi LUN-cap incident), the six binding constraints (plugin
LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration
on node1, PVE host memory, no Terraform management for K8s VMs), and
the playbooks for adding/removing workers.

Scale-up triggers:
  - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days
  - cluster memory requests > 90%
  - LUN-cap incident
  - planned ≥3 net-new block PVCs when max VA already ≥ 22
Scale-down conditions:
  - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days

Playbooks lean on scripts/provision-k8s-worker (clones template 2000,
cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete
node → qm shutdown for removes. Cold-spare option documented.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md,
beads code-oflt (IO contention long-term fix).
2026-06-01 19:50:41 +00:00