Commit graph

4714 commits

Author SHA1 Message Date
Viktor Barzin
316cdb7441 docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Runbook covers add/update/retire (one map entry; internal DNS now
cleans up after itself), content rules for Valia's folders, and the
failure modes incl. both token re-mint paths. dns.md superset-rule
paragraph now describes the declarative ConfigMap reconcile instead of
hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row
notes its Pages cutover is parked on the 42.9MB stem_video.mp4
exceeding the 25MB Pages per-file cap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:46:24 +00:00
Viktor Barzin
4a3c8287c3 Merge remote-tracking branch 'forgejo/master' into wizard/valia-sites
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 12:43:28 +00:00
Viktor Barzin
e0991853e4 valia-sites: 25MB Pages-limit guard; cloudflared: drop removed{} (CI TF <1.7)
Two fixes from the first live runs. (1) The sync job now skips a whole
site when any file exceeds Cloudflare Pages' 25MB per-file cap, leaving
current serving untouched — stem95su's stem_board.html references a
42.9MB stem_video.mp4, which made every run fail; the guard turns that
into a loud skip so bridge keeps syncing. (2) The CI terraform is older
than 1.7 and rejects removed{} blocks anywhere (pipelines 461/464), so
the bridge record handoff was completed with a one-time manual
'tg state rm module.cloudflared.cloudflare_record.bridge_pages' from
the main checkout; the block is deleted and the module comment records
the manual step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:43:13 +00:00
Viktor Barzin
348f64d34d ADR-0017: add physical-cabling diagram (wires only)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for one diagram showing just the physical connections
between nodes, separate from the logical/VLAN topology: ISP->AX6000,
the in-wall apartment->garage run into P1, 4G router (cellular OOB),
UPS mgmt, the PoE cat6 to the camera, the LAN1 cable to eno1, dark
eno2 fallback + free eno3/4, iDRAC on shared-LOM, and the note that
everything else on the R730 is virtual. Referenced from the ADR next
to the logical SVG.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:40:29 +00:00
Viktor Barzin
126cf4c88e Merge origin/master into wizard/cctv-adr-trunk
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 12:32:00 +00:00
Viktor Barzin
695e020111 cloudflared: move bridge removed{} to stack root — removed blocks are root-module-only
Some checks failed
ci/woodpecker/push/default Pipeline failed
Pipeline 461 failed terraform init: the removed{} handoff block sat in
the stack-local module, but Terraform only allows removed blocks in the
root module. Same intent, correct position (from =
module.cloudflared.cloudflare_record.bridge_pages, destroy=false).
Without this the stale state entry would make the next cloudflared
apply destroy the record valia-sites now owns.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:31:53 +00:00
Viktor Barzin
5d16a18cf4 ADR-0017: document trunk traffic semantics + ASCII topology
While reviewing the single-switch design Viktor asked whether both the
home LAN and the camera VLAN 'go via pfSense which forwards upstream' -
a natural misreading a future reader would repeat. Added a section
spelling out the vmbr0 fork: untagged home LAN is L2-bridged past
pfSense (gateway stays the AX6000, rack outage does not affect it, OOB
via 4G survives), while tagged-30 can only land on the dCCTV interface,
making a pfSense bypass impossible by construction. Includes a compact
ASCII topology for terminal readers alongside the SVG.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:31:48 +00:00
Viktor Barzin
8b80b4cc41 valia-sites: registry stack for Valia's Pages sites + declarative internal DNS (ADR-0018)
Some checks failed
Build valia-sites-sync / build (push) Waiting to run
ci/woodpecker/push/default Pipeline failed
Valia keeps asking Viktor to host 1-page sites from her Drive folders;
this makes it one map entry. New stacks/valia-sites: per site a CF Pages
project + custom domain + proxied CNAME (bridge adopted via import{}),
a ConfigMap feed (valia-sites-dns) the technitium ingress-dns-sync
script now reconciles internal CNAMEs from (add/update/REMOVE — fixes
the add-only stale-record gotcha), and one shared 10-min CronJob that
mirrors each Content folder (rclone, drive.readonly, stem95su's guards)
and wrangler-deploys ONLY on manifest change (free-tier deploy cap).
Scoped CF Pages token + shared rclone conf in secret/valia-sites; the
Global API Key never enters a pod. cloudflared forgets bridge's record
via removed{} (no destroy). stem95su is in the map dns-parked
(manage_dns=false) until its cutover commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:28:06 +00:00
Viktor Barzin
5c42155b81 docs: Valia-sites domain language + ADR-0018 (off-infra Pages, in-cluster sync)
Grill session with Viktor: his mother Valia will keep asking for 1-page
site hosting, so the pattern is being made repeatable. Decisions: all
Valia sites serve off-infra on Cloudflare Pages (survive homelab
outages); one shared in-cluster CronJob mirrors her Drive folders every
10 min and redeploys on change; English subdomain names picked by
Viktor; failed-Job-only visibility; stem95su migrates onto the pattern.
CONTEXT.md gains Valia site / Content folder / Entry file; full
rationale and rejected options in ADR-0018.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:17:45 +00:00
Viktor Barzin
e1bd111562 rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to rename the 'мост' school static site to 'bridge'.
New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already
deployed and the custom domain attached; this renames the public CNAME
(TF resource most_pages -> bridge_pages, destroy+create swaps the
record) and the internal split-horizon static CNAME in the
ingress-dns-sync CronJob. The old 'most' Pages project and the stale
internal 'most' record are removed out-of-band after this applies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:52:30 +00:00
Viktor Barzin
7dd80b6c7c technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal split-horizon zone is authoritative for viktorbarzin.me,
so the new Cloudflare Pages site (most.viktorbarzin.me, added for
Viktor's 'мост' school static site) NXDOMAINed for every internal
client — LAN, VLANs and pods — while resolving fine externally.
Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev)
in the ingress-dns-sync CronJob next to the mail-auth records, and
document the off-infra-site case in dns.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:10:46 +00:00
Viktor Barzin
217a54be9d cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to host a static HTML site (the 'мост' school project,
ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages
with a custom domain, as a try-out of Pages hosting. The site content
is deployed off-infra via wrangler to the Pages project 'most'
(most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it.
The custom domain is already attached to the Pages project and is
waiting on this DNS record to validate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:06:33 +00:00
Viktor Barzin
be80ef23bb ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable
Viktor prefers not running two switches, so the TL-SG105PE takes over
all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV
segment moves onto a managed tagged trunk over the existing LAN1 cable:
pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same
MAC so vtnet3/dCCTV survived untouched). This is safe where the original
802.1Q rejection was not, because the managed switch is the only device
on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the
documented fallback. Old SG105E retires to cold spare; PE inherits
192.168.1.6. Glossary Segment term updated (all three segments are now
bridge-tags feeding untagged pfSense vNICs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 09:15:52 +00:00
Viktor Barzin
4082934bc1 Merge origin/master into wizard/cctv-two-switch
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 08:37:34 +00:00
Viktor Barzin
e11bd6e893 ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere
Viktor asked to verify free ports on the garage switch (192.168.1.6)
before finalizing. Logging into it showed it is NOT the TL-SG105PE from
the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use
(apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch
port-VLAN design written earlier today was based on conflating the two
devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2
uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched,
and no VLAN config exists anywhere. ADR, topology SVG and networking.md
updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 08:37:15 +00:00
Viktor Barzin
08fb65827c tripit: set PLACE_PHOTO_PROVIDER=wikipedia — real place preview photos
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for place photos on the tripit Trip board. The app-side
work (add-time photo fetch, board place cards) shipped in tripit
v0.106.0, but prod never set PLACE_PHOTO_PROVIDER, so the fake provider
would store placeholder PNGs for every hand-added place. Same class of
fake-default gap as PLACE_RESOLVER_MODE (set explicitly for the same
reason); the ADR-0035 rollout had left both the env flip and its
backfill cron undone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 21:57:21 +00:00
Viktor Barzin
b761701994 ADR-0017: add network topology diagram (SVG) next to the decision
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for a reviewable network visualization committed alongside
the CCTV-segment ADR. Hand-drawn SVG (renders on Forgejo, validated
palette): physical path camera -> TL-SG105PE port-VLANs -> eno2/vmbr2 ->
pfSense dCCTV, the firewall flows (Frigate RTSP, ha-sofia ISAPI/RTSP,
NTP-only egress, default deny), and the dashed camera-day steps (patch
cable, cat6 run, AX6000 static route).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:25:28 +00:00
Viktor Barzin
248e186dce CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor and emo are adding the first owned camera at the Sofia site (HiLook
IPC-T241H-C watching the garage / server rack). Viktor asked to finalize
emo's plan; the grilling session resolved emo's five open decisions and
replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated
physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24),
port-based VLAN split on the shared TL-SG105PE, camera default-deny with
NTP-only egress, Frigate + ha-sofia as the only consumers.

The PVE bridge, pfSense interface, Kea subnet and firewall rules were
applied live this session (hand-managed hosts, backed up). This commit
records the decision (ADR-0017), the glossary terms (Segment / CCTV
segment), the as-built architecture doc, and bumps Frigate's ADR-0016
VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:01:45 +00:00
3a5194c9d4 Merge pull request 'immich(frame-emo): show photos from the last 365 days (was 730)' (#18) from emo/frame-emo-1year into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reviewed-on: #18
2026-07-02 19:05:31 +00:00
9e253d409a immich(frame-emo): show photos from the last 365 days (was 730)
Emil asked his Sofia Portal Mini photo-frame to show only the past
year of photos rolling from today, instead of the last two years.
Changes ImagesFromDays 730 -> 365 in the frame-emo Settings.yml.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 19:05:31 +00:00
Viktor Barzin
4c532dbf97 devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned
12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G)
band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel
throttled every task in the cgroup indefinitely (memory.pressure full ~80%,
oom_kill never fired) - the t3 event loop starved, the accept queue rotted,
and the terminal was dead until the hog was SIGKILLed by hand.

The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog
that stabilises between high and max never OOMs, so the throttle band is a
livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh
is now explicitly infinity on all three work cgroup definitions (t3-serve@
unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM-
killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3
server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged.
Applied live to the devvm the same day (daemon-reload + runtime set-property
on running cgroups, no session restarts). Post-mortem addendum + runbook
updated in the same commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 16:59:38 +00:00
Viktor Barzin
684ca4527c docs(CLAUDE.md): T4 now has a VRAM budget + watchdog (ADR-0016, dry-run); note llama-swap budget miscalibration
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Session wrap-up doc sync: the Immich note still claimed the shared T4 had no
VRAM isolation. Record the gpumem budget/watchdog shipped earlier today, that
the watchdog is observe-only, and that budgets need a retune (llama-swap's
real 16k-ctx resident is ~7GB, not 4.35) before arming.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 15:20:06 +00:00
Viktor Barzin
21afae85c9 dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor saw dawarich throwing 429s through Traefik and asked to loosen
the burst for it. The access log confirms the burst pattern: one page
load fires the whole fingerprinted-asset tail (SVG store badges,
favicons, webmanifest) from a single client IP and trips the default
10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429).
Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and
authentik: dedicated dawarich-rate-limit middleware (average 100 /
burst 1000) + skip_default_rate_limit on the dawarich ingress. Also
updates the networking.md middleware enumerations (adding the
previously undocumented tripit/health limiters alongside dawarich).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 15:03:08 +00:00
Viktor Barzin
91d0213d1a Merge remote-tracking branch 'forgejo/master' into wizard/excalidraw-export-rename
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build excalidraw-library / build (push) Has been cancelled
2026-07-02 14:29:34 +00:00
Viktor Barzin
8fc657f431 excalidraw: migrate image build to GHA -> private ghcr (ADR-0002)
The image was still built by hand and pushed to DockerHub (v1..v4),
predating the all-builds-off-infra doctrine; Viktor chose to move it
onto the standard pipeline while shipping the export/rename feature
rather than keep the manual flow.

Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml
(go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns
added to the Kyverno ghcr-credentials allowlist (package is PRIVATE),
deployment now pins ghcr :latest with pullPolicy Always + pull secret,
Keel force/match-tag/5m annotations seed the metadata (live values win
via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays
frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image
lists updated (also backfilled the missing k8s-portal rows in ci-cd.md).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:23 +00:00
Viktor Barzin
1cbc1e962b excalidraw: native export menu + drawing rename
Users couldn't see Excalidraw's built-in Save as / Export image options:
the app's custom toolbar was drawn exactly on top of the native hamburger
menu button, hiding it. Removed the overlay and integrated Back to
Library / Save now / Rename into the native menu, so the native export
formats (.excalidraw file, PNG, SVG, clipboard) are now reachable.
Viktor asked for exports to work via the native Excalidraw feature and
for drawings to be renameable by clicking their name.

Rename: new PATCH /api/drawings/{id} endpoint (server-side name
sanitization, 409 on conflict) + click-to-rename title pill in the
editor (updates URL in place) + Rename button/modal in the dashboard.
Existing GET/PUT/DELETE semantics unchanged for API compatibility
(emo's upload pipeline). Added main_test.go (httptest) covering rename
+ existing handler behavior; dashboard rows now DOM-built (XSS-safe).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:10 +00:00
Viktor Barzin
d94f267c93 immich: upgrade v2.7.5 → v3.0.0 (postgres → vectorchord 0.4.3, frames → immich_v3 tag)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes,
migration guide and release discussion #29439 reviewed — no config-breaking
changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD
vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over
HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement).

The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream
v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2';
Immich upgrades the extension itself at startup). Both photo frames switch
to ImmichFrame's immich_v3 compatibility tag because every versioned
ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API
responses; repin to a versioned tag once upstream ships stable v3 support.

Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so
this commit is the source-of-truth record; the live rollout happens via
kubectl set image in the same session. Pre-upgrade pg_dumpall taken
(job postgresql-backup-pre-v3).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:18:22 +00:00
Viktor Barzin
6f03ccd1aa excalidraw: grant emo-browser SA port-forward for drawing uploads
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to fix emo's permission so his Claude can upload to the
Excalidraw service. emo's recent sessions show the documented upload
recipe (kubectl port-forward svc/draw + X-Authentik-Username header,
from his ~/.claude/CLAUDE.md) failing with:

  pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser
  in namespace excalidraw

because his default kubeconfig is the read-only emo-browser SA (its
port-forward grant covers only chrome-service) and his old admin
kubeconfig at /home/emo/code/config expired and was removed.

Add a namespace-scoped Role (pods/portforward create) + RoleBinding for
that SA in the excalidraw namespace, mirroring the 2026-06-28
chrome-service grant. Trade-off (any-user drawings via the trusted
username header) documented in the file and accepted.

Also record the grant in docs/architecture/chrome-service.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:08:28 +00:00
Viktor Barzin
88c86e2109 ci: Slack-notify failed pipeline runs only
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor doesn't want a Slack message for every CI run — only failures.
The infra apply pipeline posted a status line to #general on every push,
and the renew-tls / postmortem-todos / registry-config-sync /
pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine
messages a week). Now: the apply pipeline's success post is gone
(notify-failure already covers failures), all cron notifies are
status:[failure] with explicit FAILED texts, and drift-detection is
silent when all stacks are clean (still posts drift findings and errors,
and gains a hard-failure catch step it previously lacked). Kept:
notify-nonadmin-push (org audit feed) and the actionable provision-user
post. Per-app deploy template in ci-cd.md updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:27:43 +00:00
Viktor Barzin
a64d2ba2b9 upgrades: fix hourly gotenberg error + cap update notifications at weekly
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor was getting upgrade-error Slack messages every hour and wants
update notifications at most weekly. Root cause of the errors: Keel kept
trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's
require-trusted-registries denied it — gotenberg/* (and apache/*, which
tika will hit next) were never allowlisted, and Keel's Slack notifier at
info level re-posted the identical failure to #general on every hourly
poll since Jun 28.

Changes: allowlist gotenberg/* + apache/* so the patch applies cleanly;
disable Keel's direct Slack notifier and replace failure visibility with
a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification
plus the daily digest, never an hourly drip); remove diun's Slack
notifier whose default message @channel-pinged #image-updates for every
new upstream tag every 6h (the n8n upgrade-agent webhook feed is
untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC).
Paperless-ngx itself stays paused (keel policy=never, user-managed) while
the ingest runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:16:50 +00:00
Viktor Barzin
5d5d9752cb guard: ignore + git-crypt kubeconfig files so they can't leak to the public mirror
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A GitGuardian audit of the infra repo showed the recent alerts were test
fixtures (false positives), but surfaced a real historical leak: a
cluster-admin kubeconfig was once committed as stacks/f1-stream/.../.config
(now expired, reachable only via a GitHub PR ref). The .gitignore already had
a `config` rule for kubeconfigs but missed the dotfile form `.config` — which
is exactly how that file slipped onto the public mirror.

Close the gap in two layers:
- .gitignore: also ignore `.config`, `kubeconfig`, `*.kubeconfig`,
  `admin.conf`, `.kube/` so they're never staged by accident.
- .gitattributes: route `.config`, `kubeconfig`, `*.kubeconfig`, `admin.conf`
  through git-crypt so a force-add or rename still lands as ciphertext (never
  plaintext) on the public GitHub mirror.

No tracked files match these names today, so there is zero retroactive impact
— purely forward-looking prevention.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:14:58 +00:00
Viktor Barzin
dab307f9f8 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-02 05:39:15 +00:00
Viktor Barzin
f1e81772d5 broker-sync: repoint image to ghcr (was frozen on pre-migration DockerHub)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The nightly ibkr sync failed with 'No such command ibkr': every broker-sync
CronJob still pulled viktorbarzin/broker-sync:latest from DockerHub, which
nothing has pushed to since the ADR-0002 move to GHA->ghcr on 2026-06-13 —
the jobs were silently running a frozen pre-ibkr build. The migration had
allowlisted only the wealthfolio namespace for the private
ghcr.io/viktorbarzin/wealthfolio-sync image, so broker-sync also lacked
pull credentials. Repoint the image, add ghcr-credentials imagePullSecrets
to all eight CronJobs, and allowlist the broker-sync namespace (wealthfolio
stays — its own monthly sync pulls the same image). Related: code-9ko8.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 05:31:00 +00:00
Viktor Barzin
ac41e7c017 nvidia: run advertise-gpumem provisioner under bash (dash rejects pipefail)
First apply of ADR-0016 failed: terraform local-exec defaults to /bin/sh,
which on Ubuntu is dash — 'set -euo pipefail' exits 2 before running kubectl.
Pin the interpreter to bash. Everything else in the gpumem apply succeeded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 05:21:47 +00:00
Viktor Barzin
968b2b9c64 Merge remote-tracking branch 'origin/master' into wizard/gpu-vram-budget 2026-07-02 05:18:34 +00:00
Viktor Barzin
a12b09af04 broker-sync: pin data-mounting CronJobs to k8s-node4 (stop nightly RWO wedge)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
All broker-sync CronJobs share one RWO proxmox-lvm volume. With free
scheduling the nightly 02:00-04:15 runs land on different nodes, forcing
a detach/attach cycle whose QMP hotplug intermittently ghost-attaches on
disk-heavy VMs — every job then sits in ContainerCreating for hours
(happened 2026-06-30, 07-01 and again 07-02; fires
PodsStuckContainerCreating and skips the day's trade syncs). Pinning all
seven volume-mounting jobs to k8s-node4 (fewest CSI disks, 11) makes the
volume attach once and stay put — no hotplug dance, no wedge.
version_probe mounts nothing and stays unpinned. Durable fix for the
recurrence tracked in beads code-9ko8.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 05:16:38 +00:00
Viktor Barzin
3c85af2dc2 fire-countdown dashboard: SQL guards + tax regime + honesty fixes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
From the flaw-hunt workflow (all verified):
- Projected-FIRE-date panels (solo/household/family) now guard savings £/yr:
  0 / empty / negative all render "Set savings £/yr" instead of a blank tile,
  a SQL error, or a nonsensical past date ("Jan 1849"). Verified across cases.
- New "Tax regime" panel surfaces the per-country jurisdiction — 14/22 countries
  fall back to the neutral 'nomad' 1% assumption, which was previously invisible.
- Intro no longer hard-codes "£139k pension" (contradicted the £328k tranche
  panel); pension value is now only shown data-bound in the tranche panel.
- Intro adds caveats: Anca's spend is an estimate (pending live re-pull), and
  non-modelled countries use the nomad tax fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 22:44:17 +00:00
Viktor Barzin
339f5d89b9 onlyoffice: decommission (stack destroyed, dir removed)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The document server had been deliberately scaled to 0/0 for 184 days, but
its ingress kept the uptime-kuma monitors alive, so 'onlyoffice down'
showed up in every daily alert digest. Viktor approved tearing it down.
terragrunt destroy ran clean (11 resources) before this commit; the kuma
monitors auto-prune with the ingress. Also drops the onlyoffice/* image
prefix from the kyverno trusted-registries allowlist, the service-catalog
rows, and updates the nextcloud collabora comment. Document data (if any)
remains on the PVE NFS share.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-01 22:35:22 +00:00
Viktor Barzin
3c476dab32 postiz+portal: remove broken alert sources (stale backup CronJob, bogus scrape annotations)
Viktor is getting daily Slack alert noise; these two were the recurring
generators. The postiz-postgres-backup CronJob still dumped from the old
in-namespace postiz-postgresql service that was removed in the CNPG
migration (2026-06-28) — it failed every night at 03:00 and re-fired
BackupCronJobFailed each day. The postiz DB now lives on the shared CNPG
cluster and is already covered by the dbaas per-db dumps, so the CronJob
(and its NFS backup volume) is redundant and removed rather than repaired.

portal-stt/portal-tts advertised prometheus.io scrape annotations that
never worked: the deployed Speaches build 404s /metrics, and openai-edge-tts
has no metrics at all (its annotation pointed at a JSON endpoint, which
fails exposition parsing regardless). Both produced a permanently firing
ScrapeTargetDown. Annotations removed until the apps actually serve metrics.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-01 22:35:21 +00:00
Viktor Barzin
5a312563c6 monitoring/wealth: dash the in-progress year on the hourly-rate panel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The current, still-accruing calendar year read misleadingly high (e.g. 2026
at 5 months showed £149/h gross, above all of 2025) because the full-year
bonus - paid every March - plus front-loaded quarterly RSU vests get divided
by only the months worked so far. It settles lower as the year completes.

Split each line into a solid series (complete years) and a dashed series
(the latest, still-accruing year), so the provisional point is visually
flagged. The split auto-detects the in-progress year (latest year with
< 12 months of payslips), so it needs no per-year maintenance. Panel
description now explains the caveat.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:45:51 +00:00
Viktor Barzin
28984dda9a monitoring/wealth: add per-year effective hourly-rate panel (gross vs net)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted to see, on the wealth dashboard, the hourly wage he earned
each year - both gross and net - with year on the X axis.

New timeseries (line) panel "Effective hourly rate - gross vs net":
- hourly = annual pay / hours worked; hours = contractual 40h/week
  (2,080h per full year, confirmed from the Facebook/Meta UK offer letter:
  Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually
  worked so partial years (2019, 2020, 2026) read correctly.
- Gross = gross_pay incl. notional RSU vest; Net = take-home.
- timeFrom 10y so all years show under the dashboard's default 180d range.

Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload
of doc 33) was removed separately, so 2023 is no longer double-counted; this
also corrects the existing net-pay panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:28:46 +00:00
Viktor Barzin
82371d1ef8 dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
MySQL device-write investigation (code-oflt): after the nextcloud webcal
throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients),
MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s,
~55 pages/s) is the doublewrite buffer writing every flushed page twice.
Redo is negligible (0.01 MB/s), no temp-table spilling.

Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the
cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer
(~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps
torn-page DETECTION metadata — a crash-torn page is flagged on recovery
(restore from the daily mysqldump) rather than silently corrupt. Chosen
over full OFF: same write saving, keeps detection, and OFF requires a
shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable
risk given the PERC BBU cache + UPS (in-flight writes complete on power
loss) + daily per-db backups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:47:09 +00:00
Viktor Barzin
fbae573664 state(dbaas): update encrypted state 2026-06-30 08:46:45 +00:00
Viktor Barzin
71501be408 nodes: journald -> volatile (RAM) to cut sdc write-IOPS
Some checks failed
ci/woodpecker/push/default Pipeline failed
Node "container churn" investigation (code-oflt): container logs (~30 KB/s)
and overlayfs (~17 KB/s) are negligible; the node OS-disk churn is ext4
journal (jbd2) metadata writes driven mostly by journald's continuous
appends. node4 + node5 had drifted to uncapped persistent journald (4 GB
each, ~100 KB/s); master/node1-3 were correctly capped at 500M.

Node + pod journals already ship to Loki (alloy loki.source.journal), so
on-disk journald is pure write-IOPS overhead on the IOPS-bound sdc. Switch
journald to Storage=volatile (RAM, RuntimeMaxUse=200M) fleet-wide:
- cloud_init.yaml: drop-in 90-oflt-volatile.conf for new nodes (replaces
  the old persistent seds).
- running nodes (master + node1-5): pushed the same drop-in via qm guest
  exec + journald restart + cleared /var/log/journal.

Verified node5: OS-disk writers jbd2/sda1-8 931->46 KB/s, systemd-journal
gone (~94% drop); ~4 GB freed each on node4/node5. Logs stay queryable in
Loki. Trade-off: a hard crash loses the last unshipped journal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:15:38 +00:00
Viktor Barzin
74819d4061 feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation
The single time-sliced Tesla T4 has no per-tenant memory isolation, so its
~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's
onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking
recruiter-responder for ~5h. Viktor asked for memory protection so we don't
overallocate GPU memory, and chose to do it at the scheduling level (no
device-plugin swap) after weighing HAMi and MPS.

Make the scheduler VRAM-aware and add runtime teeth, all repo-native,
time-slicing untouched:
- Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a
  reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob.
- Each always-on GPU tenant declares a gpumem budget (immich-ml 3000,
  llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300
  <= advertised) so the scheduler refuses to co-schedule past the card
  (overflow -> Pending).
- gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when
  actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to
  false after a few cycles look right.
- Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown --
  the 2026-06-02 post-mortem's never-built free-VRAM follow-up.
- Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing
  glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment.

HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node
shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:57:40 +00:00
Viktor Barzin
1afe41880e docs: MySQL buffer-pool/limit + nextcloud webcal throttle; VCT drift fixed
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reflect the code-oflt MySQL write-reduction work (commit 82c9e69b + the
nextcloud webcal app-data throttle):
- MySQL row: buffer pool 1->2Gi, mem limit 4->6Gi, and the nextcloud
  webcal calendar churn that was ~60% of MySQL's writes (now throttled
  in oc_calendarsubscriptions.refreshrate — app-data, can regress).
- CNPG apply-gotcha note: the mysql_standalone VCT-annotation drift no
  longer needs -target dodging (now ignore_changes'd on the STS VCT).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:56:04 +00:00
Viktor Barzin
82c9e69b77 dbaas/mysql: 2Gi InnoDB buffer pool + 6Gi limit + ignore VCT drift
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt).
Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live
attribution found ~60% of its writes were nextcloud webcal calendar churn
(throttled separately at the app layer); this addresses write amplification
on the remainder:

- innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi
  hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free
  page -> constant flush-to-make-room write IOPS).
- container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already
  at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the
  headroom. One-time MySQL pod restart to apply.
- ignore_changes on the StatefulSet volume_claim_template: the VCT is
  immutable post-creation and pvc-autoresizer rewrites its annotations on
  the live object, so TF's desired VCT could never apply and errored every
  broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the
  long-standing need to -target around it.

Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy,
24 DBs reachable, restart clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:55:18 +00:00
Viktor Barzin
29bf275cef state(dbaas): update encrypted state 2026-06-30 07:53:48 +00:00
Viktor Barzin
308a174ad6 docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP
(10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc
still listed only four IPs in use / three dedicated. Add the .204 row to
the allocation table, bump the counts (five in use, four dedicated, 5-IP
layout), and add a LB-IP renumber-checklist entry for the out-of-band
consumers (the go2rtc WebRTC candidate on the frigate config PVC and the
HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE
candidates, so the Service annotation is the single source of truth.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:42:27 +00:00
469cdd7507 frigate: expose go2rtc on a dedicated MetalLB LB IP (RTSP 8554 + WebRTC 8555)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
HA live video from the cluster Frigate hangs/fails because the only path
to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203),
which cannot carry RTSP or WebRTC. The container already listens on
8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was
never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB
IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 +
WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN
endpoint for native HLS/WebRTC live (parity with the Hikvision NVR).
Companion non-Terraform steps are in the PR body.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:15:22 +00:00