infra

Author	SHA1	Message	Date
Viktor Barzin	316cdb7441	docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries All checks were successful ci/woodpecker/push/default Pipeline was successful Details Runbook covers add/update/retire (one map entry; internal DNS now cleans up after itself), content rules for Valia's folders, and the failure modes incl. both token re-mint paths. dns.md superset-rule paragraph now describes the declarative ConfigMap reconcile instead of hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row notes its Pages cutover is parked on the 42.9MB stem_video.mp4 exceeding the 25MB Pages per-file cap. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:46:24 +00:00
Viktor Barzin	4a3c8287c3	Merge remote-tracking branch 'forgejo/master' into wizard/valia-sites All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-03 12:43:28 +00:00
Viktor Barzin	e0991853e4	valia-sites: 25MB Pages-limit guard; cloudflared: drop removed{} (CI TF <1.7) Two fixes from the first live runs. (1) The sync job now skips a whole site when any file exceeds Cloudflare Pages' 25MB per-file cap, leaving current serving untouched — stem95su's stem_board.html references a 42.9MB stem_video.mp4, which made every run fail; the guard turns that into a loud skip so bridge keeps syncing. (2) The CI terraform is older than 1.7 and rejects removed{} blocks anywhere (pipelines 461/464), so the bridge record handoff was completed with a one-time manual 'tg state rm module.cloudflared.cloudflare_record.bridge_pages' from the main checkout; the block is deleted and the module comment records the manual step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:43:13 +00:00
Viktor Barzin	348f64d34d	ADR-0017: add physical-cabling diagram (wires only) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for one diagram showing just the physical connections between nodes, separate from the logical/VLAN topology: ISP->AX6000, the in-wall apartment->garage run into P1, 4G router (cellular OOB), UPS mgmt, the PoE cat6 to the camera, the LAN1 cable to eno1, dark eno2 fallback + free eno3/4, iDRAC on shared-LOM, and the note that everything else on the R730 is virtual. Referenced from the ADR next to the logical SVG. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:40:29 +00:00
Viktor Barzin	126cf4c88e	Merge origin/master into wizard/cctv-adr-trunk All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-03 12:32:00 +00:00
Viktor Barzin	695e020111	cloudflared: move bridge removed{} to stack root — removed blocks are root-module-only Some checks failed ci/woodpecker/push/default Pipeline failed Details Pipeline 461 failed terraform init: the removed{} handoff block sat in the stack-local module, but Terraform only allows removed blocks in the root module. Same intent, correct position (from = module.cloudflared.cloudflare_record.bridge_pages, destroy=false). Without this the stale state entry would make the next cloudflared apply destroy the record valia-sites now owns. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:31:53 +00:00
Viktor Barzin	5d16a18cf4	ADR-0017: document trunk traffic semantics + ASCII topology While reviewing the single-switch design Viktor asked whether both the home LAN and the camera VLAN 'go via pfSense which forwards upstream' - a natural misreading a future reader would repeat. Added a section spelling out the vmbr0 fork: untagged home LAN is L2-bridged past pfSense (gateway stays the AX6000, rack outage does not affect it, OOB via 4G survives), while tagged-30 can only land on the dCCTV interface, making a pfSense bypass impossible by construction. Includes a compact ASCII topology for terminal readers alongside the SVG. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:31:48 +00:00
Viktor Barzin	8b80b4cc41	valia-sites: registry stack for Valia's Pages sites + declarative internal DNS (ADR-0018) Some checks failed Build valia-sites-sync / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline failed Details Valia keeps asking Viktor to host 1-page sites from her Drive folders; this makes it one map entry. New stacks/valia-sites: per site a CF Pages project + custom domain + proxied CNAME (bridge adopted via import{}), a ConfigMap feed (valia-sites-dns) the technitium ingress-dns-sync script now reconciles internal CNAMEs from (add/update/REMOVE — fixes the add-only stale-record gotcha), and one shared 10-min CronJob that mirrors each Content folder (rclone, drive.readonly, stem95su's guards) and wrangler-deploys ONLY on manifest change (free-tier deploy cap). Scoped CF Pages token + shared rclone conf in secret/valia-sites; the Global API Key never enters a pod. cloudflared forgets bridge's record via removed{} (no destroy). stem95su is in the map dns-parked (manage_dns=false) until its cutover commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:28:06 +00:00
Viktor Barzin	5c42155b81	docs: Valia-sites domain language + ADR-0018 (off-infra Pages, in-cluster sync) Grill session with Viktor: his mother Valia will keep asking for 1-page site hosting, so the pattern is being made repeatable. Decisions: all Valia sites serve off-infra on Cloudflare Pages (survive homelab outages); one shared in-cluster CronJob mirrors her Drive folders every 10 min and redeploys on change; English subdomain names picked by Viktor; failed-Job-only visibility; stem95su migrates onto the pattern. CONTEXT.md gains Valia site / Content folder / Entry file; full rationale and rejected options in ADR-0018. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:17:45 +00:00
Viktor Barzin	e1bd111562	rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to rename the 'мост' school static site to 'bridge'. New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already deployed and the custom domain attached; this renames the public CNAME (TF resource most_pages -> bridge_pages, destroy+create swaps the record) and the internal split-horizon static CNAME in the ingress-dns-sync CronJob. The old 'most' Pages project and the stale internal 'most' record are removed out-of-band after this applies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:52:30 +00:00
Viktor Barzin	7dd80b6c7c	technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The internal split-horizon zone is authoritative for viktorbarzin.me, so the new Cloudflare Pages site (most.viktorbarzin.me, added for Viktor's 'мост' school static site) NXDOMAINed for every internal client — LAN, VLANs and pods — while resolving fine externally. Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev) in the ingress-dns-sync CronJob next to the mail-auth records, and document the off-infra-site case in dns.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:10:46 +00:00
Viktor Barzin	217a54be9d	cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to host a static HTML site (the 'мост' school project, ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages with a custom domain, as a try-out of Pages hosting. The site content is deployed off-infra via wrangler to the Pages project 'most' (most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it. The custom domain is already attached to the Pages project and is waiting on this DNS record to validate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:06:33 +00:00
Viktor Barzin	be80ef23bb	ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable Viktor prefers not running two switches, so the TL-SG105PE takes over all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV segment moves onto a managed tagged trunk over the existing LAN1 cable: pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same MAC so vtnet3/dCCTV survived untouched). This is safe where the original 802.1Q rejection was not, because the managed switch is the only device on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the documented fallback. Old SG105E retires to cold spare; PE inherits 192.168.1.6. Glossary Segment term updated (all three segments are now bridge-tags feeding untagged pfSense vNICs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 09:15:52 +00:00
Viktor Barzin	4082934bc1	Merge origin/master into wizard/cctv-two-switch All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-03 08:37:34 +00:00
Viktor Barzin	e11bd6e893	ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere Viktor asked to verify free ports on the garage switch (192.168.1.6) before finalizing. Logging into it showed it is NOT the TL-SG105PE from the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use (apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch port-VLAN design written earlier today was based on conflating the two devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2 uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched, and no VLAN config exists anywhere. ADR, topology SVG and networking.md updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 08:37:15 +00:00
Viktor Barzin	08fb65827c	tripit: set PLACE_PHOTO_PROVIDER=wikipedia — real place preview photos All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for place photos on the tripit Trip board. The app-side work (add-time photo fetch, board place cards) shipped in tripit v0.106.0, but prod never set PLACE_PHOTO_PROVIDER, so the fake provider would store placeholder PNGs for every hand-added place. Same class of fake-default gap as PLACE_RESOLVER_MODE (set explicitly for the same reason); the ADR-0035 rollout had left both the env flip and its backfill cron undone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 21:57:21 +00:00
Viktor Barzin	b761701994	ADR-0017: add network topology diagram (SVG) next to the decision All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for a reviewable network visualization committed alongside the CCTV-segment ADR. Hand-drawn SVG (renders on Forgejo, validated palette): physical path camera -> TL-SG105PE port-VLANs -> eno2/vmbr2 -> pfSense dCCTV, the firewall flows (Frigate RTSP, ha-sofia ISAPI/RTSP, NTP-only egress, default deny), and the dashed camera-day steps (patch cable, cat6 run, AX6000 static route). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 20:25:28 +00:00
Viktor Barzin	248e186dce	CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor and emo are adding the first owned camera at the Sofia site (HiLook IPC-T241H-C watching the garage / server rack). Viktor asked to finalize emo's plan; the grilling session resolved emo's five open decisions and replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24), port-based VLAN split on the shared TL-SG105PE, camera default-deny with NTP-only egress, Frigate + ha-sofia as the only consumers. The PVE bridge, pfSense interface, Kea subnet and firewall rules were applied live this session (hand-managed hosts, backed up). This commit records the decision (ADR-0017), the glossary terms (Segment / CCTV segment), the as-built architecture doc, and bumps Frigate's ADR-0016 VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 20:01:45 +00:00
viktor	3a5194c9d4	Merge pull request 'immich(frame-emo): show photos from the last 365 days (was 730)' (#18 ) from emo/frame-emo-1year into master All checks were successful ci/woodpecker/push/default Pipeline was successful Details Reviewed-on: #18	2026-07-02 19:05:31 +00:00
ebarzin	9e253d409a	immich(frame-emo): show photos from the last 365 days (was 730) Emil asked his Sofia Portal Mini photo-frame to show only the past year of photos rolling from today, instead of the last two years. Changes ImagesFromDays 730 -> 365 in the frame-emo Settings.yml. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 19:05:31 +00:00
Viktor Barzin	4c532dbf97	devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM All checks were successful ci/woodpecker/push/postmortem-todos Pipeline was successful Details ci/woodpecker/push/default Pipeline was successful Details t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned 12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G) band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel throttled every task in the cgroup indefinitely (memory.pressure full ~80%, oom_kill never fired) - the t3 event loop starved, the accept queue rotted, and the terminal was dead until the hog was SIGKILLed by hand. The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog that stabilises between high and max never OOMs, so the throttle band is a livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh is now explicitly infinity on all three work cgroup definitions (t3-serve@ unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM- killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3 server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged. Applied live to the devvm the same day (daemon-reload + runtime set-property on running cgroups, no session restarts). Post-mortem addendum + runbook updated in the same commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 16:59:38 +00:00
Viktor Barzin	684ca4527c	docs(CLAUDE.md): T4 now has a VRAM budget + watchdog (ADR-0016, dry-run); note llama-swap budget miscalibration All checks were successful ci/woodpecker/push/default Pipeline was successful Details Session wrap-up doc sync: the Immich note still claimed the shared T4 had no VRAM isolation. Record the gpumem budget/watchdog shipped earlier today, that the watchdog is observe-only, and that budgets need a retune (llama-swap's real 16k-ctx resident is ~7GB, not 4.35) before arming. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 15:20:06 +00:00
Viktor Barzin	21afae85c9	dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor saw dawarich throwing 429s through Traefik and asked to loosen the burst for it. The access log confirms the burst pattern: one page load fires the whole fingerprinted-asset tail (SVG store badges, favicons, webmanifest) from a single client IP and trips the default 10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429). Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and authentik: dedicated dawarich-rate-limit middleware (average 100 / burst 1000) + skip_default_rate_limit on the dawarich ingress. Also updates the networking.md middleware enumerations (adding the previously undocumented tripit/health limiters alongside dawarich). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 15:03:08 +00:00
Viktor Barzin	91d0213d1a	Merge remote-tracking branch 'forgejo/master' into wizard/excalidraw-export-rename Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build excalidraw-library / build (push) Has been cancelled Details	2026-07-02 14:29:34 +00:00
Viktor Barzin	8fc657f431	excalidraw: migrate image build to GHA -> private ghcr (ADR-0002) The image was still built by hand and pushed to DockerHub (v1..v4), predating the all-builds-off-infra doctrine; Viktor chose to move it onto the standard pipeline while shipping the export/rename feature rather than keep the manual flow. Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml (go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns added to the Kyverno ghcr-credentials allowlist (package is PRIVATE), deployment now pins ghcr :latest with pullPolicy Always + pull secret, Keel force/match-tag/5m annotations seed the metadata (live values win via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image lists updated (also backfilled the missing k8s-portal rows in ci-cd.md). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:29:23 +00:00
Viktor Barzin	1cbc1e962b	excalidraw: native export menu + drawing rename Users couldn't see Excalidraw's built-in Save as / Export image options: the app's custom toolbar was drawn exactly on top of the native hamburger menu button, hiding it. Removed the overlay and integrated Back to Library / Save now / Rename into the native menu, so the native export formats (.excalidraw file, PNG, SVG, clipboard) are now reachable. Viktor asked for exports to work via the native Excalidraw feature and for drawings to be renameable by clicking their name. Rename: new PATCH /api/drawings/{id} endpoint (server-side name sanitization, 409 on conflict) + click-to-rename title pill in the editor (updates URL in place) + Rename button/modal in the dashboard. Existing GET/PUT/DELETE semantics unchanged for API compatibility (emo's upload pipeline). Added main_test.go (httptest) covering rename + existing handler behavior; dashboard rows now DOM-built (XSS-safe). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:29:10 +00:00
Viktor Barzin	d94f267c93	immich: upgrade v2.7.5 → v3.0.0 (postgres → vectorchord 0.4.3, frames → immich_v3 tag) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes, migration guide and release discussion #29439 reviewed — no config-breaking changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement). The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2'; Immich upgrades the extension itself at startup). Both photo frames switch to ImmichFrame's immich_v3 compatibility tag because every versioned ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API responses; repin to a versioned tag once upstream ships stable v3 support. Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so this commit is the source-of-truth record; the live rollout happens via kubectl set image in the same session. Pre-upgrade pg_dumpall taken (job postgresql-backup-pre-v3). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:18:22 +00:00
Viktor Barzin	6f03ccd1aa	excalidraw: grant emo-browser SA port-forward for drawing uploads All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to fix emo's permission so his Claude can upload to the Excalidraw service. emo's recent sessions show the documented upload recipe (kubectl port-forward svc/draw + X-Authentik-Username header, from his ~/.claude/CLAUDE.md) failing with: pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser in namespace excalidraw because his default kubeconfig is the read-only emo-browser SA (its port-forward grant covers only chrome-service) and his old admin kubeconfig at /home/emo/code/config expired and was removed. Add a namespace-scoped Role (pods/portforward create) + RoleBinding for that SA in the excalidraw namespace, mirroring the 2026-06-28 chrome-service grant. Trade-off (any-user drawings via the trusted username header) documented in the file and accepted. Also record the grant in docs/architecture/chrome-service.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 11:08:28 +00:00
Viktor Barzin	88c86e2109	ci: Slack-notify failed pipeline runs only All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor doesn't want a Slack message for every CI run — only failures. The infra apply pipeline posted a status line to #general on every push, and the renew-tls / postmortem-todos / registry-config-sync / pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine messages a week). Now: the apply pipeline's success post is gone (notify-failure already covers failures), all cron notifies are status:[failure] with explicit FAILED texts, and drift-detection is silent when all stacks are clean (still posts drift findings and errors, and gains a hard-failure catch step it previously lacked). Kept: notify-nonadmin-push (org audit feed) and the actionable provision-user post. Per-app deploy template in ci-cd.md updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:27:43 +00:00
Viktor Barzin	a64d2ba2b9	upgrades: fix hourly gotenberg error + cap update notifications at weekly All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor was getting upgrade-error Slack messages every hour and wants update notifications at most weekly. Root cause of the errors: Keel kept trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's require-trusted-registries denied it — gotenberg/* (and apache/, which tika will hit next) were never allowlisted, and Keel's Slack notifier at info level re-posted the identical failure to #general on every hourly poll since Jun 28. Changes: allowlist gotenberg/ + apache/* so the patch applies cleanly; disable Keel's direct Slack notifier and replace failure visibility with a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification plus the daily digest, never an hourly drip); remove diun's Slack notifier whose default message @channel-pinged #image-updates for every new upstream tag every 6h (the n8n upgrade-agent webhook feed is untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC). Paperless-ngx itself stays paused (keel policy=never, user-managed) while the ingest runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:16:50 +00:00
Viktor Barzin	5d5d9752cb	guard: ignore + git-crypt kubeconfig files so they can't leak to the public mirror All checks were successful ci/woodpecker/push/default Pipeline was successful Details A GitGuardian audit of the infra repo showed the recent alerts were test fixtures (false positives), but surfaced a real historical leak: a cluster-admin kubeconfig was once committed as stacks/f1-stream/.../.config (now expired, reachable only via a GitHub PR ref). The .gitignore already had a `config` rule for kubeconfigs but missed the dotfile form `.config` — which is exactly how that file slipped onto the public mirror. Close the gap in two layers: - .gitignore: also ignore `.config`, `kubeconfig`, `.kubeconfig`, `admin.conf`, `.kube/` so they're never staged by accident. - .gitattributes: route `.config`, `kubeconfig`, `.kubeconfig`, `admin.conf` through git-crypt so a force-add or rename still lands as ciphertext (never plaintext) on the public GitHub mirror. No tracked files match these names today, so there is zero retroactive impact — purely forward-looking prevention. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:14:58 +00:00
Viktor Barzin	dab307f9f8	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-02 05:39:15 +00:00
Viktor Barzin	f1e81772d5	broker-sync: repoint image to ghcr (was frozen on pre-migration DockerHub) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The nightly ibkr sync failed with 'No such command ibkr': every broker-sync CronJob still pulled viktorbarzin/broker-sync:latest from DockerHub, which nothing has pushed to since the ADR-0002 move to GHA->ghcr on 2026-06-13 — the jobs were silently running a frozen pre-ibkr build. The migration had allowlisted only the wealthfolio namespace for the private ghcr.io/viktorbarzin/wealthfolio-sync image, so broker-sync also lacked pull credentials. Repoint the image, add ghcr-credentials imagePullSecrets to all eight CronJobs, and allowlist the broker-sync namespace (wealthfolio stays — its own monthly sync pulls the same image). Related: code-9ko8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 05:31:00 +00:00
Viktor Barzin	ac41e7c017	nvidia: run advertise-gpumem provisioner under bash (dash rejects pipefail) First apply of ADR-0016 failed: terraform local-exec defaults to /bin/sh, which on Ubuntu is dash — 'set -euo pipefail' exits 2 before running kubectl. Pin the interpreter to bash. Everything else in the gpumem apply succeeded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 05:21:47 +00:00
Viktor Barzin	968b2b9c64	Merge remote-tracking branch 'origin/master' into wizard/gpu-vram-budget	2026-07-02 05:18:34 +00:00
Viktor Barzin	a12b09af04	broker-sync: pin data-mounting CronJobs to k8s-node4 (stop nightly RWO wedge) All checks were successful ci/woodpecker/push/default Pipeline was successful Details All broker-sync CronJobs share one RWO proxmox-lvm volume. With free scheduling the nightly 02:00-04:15 runs land on different nodes, forcing a detach/attach cycle whose QMP hotplug intermittently ghost-attaches on disk-heavy VMs — every job then sits in ContainerCreating for hours (happened 2026-06-30, 07-01 and again 07-02; fires PodsStuckContainerCreating and skips the day's trade syncs). Pinning all seven volume-mounting jobs to k8s-node4 (fewest CSI disks, 11) makes the volume attach once and stay put — no hotplug dance, no wedge. version_probe mounts nothing and stays unpinned. Durable fix for the recurrence tracked in beads code-9ko8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 05:16:38 +00:00
Viktor Barzin	3c85af2dc2	fire-countdown dashboard: SQL guards + tax regime + honesty fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details From the flaw-hunt workflow (all verified): - Projected-FIRE-date panels (solo/household/family) now guard savings £/yr: 0 / empty / negative all render "Set savings £/yr" instead of a blank tile, a SQL error, or a nonsensical past date ("Jan 1849"). Verified across cases. - New "Tax regime" panel surfaces the per-country jurisdiction — 14/22 countries fall back to the neutral 'nomad' 1% assumption, which was previously invisible. - Intro no longer hard-codes "£139k pension" (contradicted the £328k tranche panel); pension value is now only shown data-bound in the tranche panel. - Intro adds caveats: Anca's spend is an estimate (pending live re-pull), and non-modelled countries use the nomad tax fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-01 22:44:17 +00:00
Viktor Barzin	339f5d89b9	onlyoffice: decommission (stack destroyed, dir removed) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The document server had been deliberately scaled to 0/0 for 184 days, but its ingress kept the uptime-kuma monitors alive, so 'onlyoffice down' showed up in every daily alert digest. Viktor approved tearing it down. terragrunt destroy ran clean (11 resources) before this commit; the kuma monitors auto-prune with the ingress. Also drops the onlyoffice/* image prefix from the kyverno trusted-registries allowlist, the service-catalog rows, and updates the nextcloud collabora comment. Document data (if any) remains on the PVE NFS share. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-01 22:35:22 +00:00
Viktor Barzin	3c476dab32	postiz+portal: remove broken alert sources (stale backup CronJob, bogus scrape annotations) Viktor is getting daily Slack alert noise; these two were the recurring generators. The postiz-postgres-backup CronJob still dumped from the old in-namespace postiz-postgresql service that was removed in the CNPG migration (2026-06-28) — it failed every night at 03:00 and re-fired BackupCronJobFailed each day. The postiz DB now lives on the shared CNPG cluster and is already covered by the dbaas per-db dumps, so the CronJob (and its NFS backup volume) is redundant and removed rather than repaired. portal-stt/portal-tts advertised prometheus.io scrape annotations that never worked: the deployed Speaches build 404s /metrics, and openai-edge-tts has no metrics at all (its annotation pointed at a JSON endpoint, which fails exposition parsing regardless). Both produced a permanently firing ScrapeTargetDown. Annotations removed until the apps actually serve metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-01 22:35:21 +00:00
Viktor Barzin	5a312563c6	monitoring/wealth: dash the in-progress year on the hourly-rate panel All checks were successful ci/woodpecker/push/default Pipeline was successful Details The current, still-accruing calendar year read misleadingly high (e.g. 2026 at 5 months showed £149/h gross, above all of 2025) because the full-year bonus - paid every March - plus front-loaded quarterly RSU vests get divided by only the months worked so far. It settles lower as the year completes. Split each line into a solid series (complete years) and a dashed series (the latest, still-accruing year), so the provisional point is visually flagged. The split auto-detects the in-progress year (latest year with < 12 months of payslips), so it needs no per-year maintenance. Panel description now explains the caveat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:45:51 +00:00
Viktor Barzin	28984dda9a	monitoring/wealth: add per-year effective hourly-rate panel (gross vs net) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted to see, on the wealth dashboard, the hourly wage he earned each year - both gross and net - with year on the X axis. New timeseries (line) panel "Effective hourly rate - gross vs net": - hourly = annual pay / hours worked; hours = contractual 40h/week (2,080h per full year, confirmed from the Facebook/Meta UK offer letter: Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually worked so partial years (2019, 2020, 2026) read correctly. - Gross = gross_pay incl. notional RSU vest; Net = take-home. - timeFrom 10y so all years show under the dashboard's default 180d range. Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload of doc 33) was removed separately, so 2023 is no longer double-counted; this also corrects the existing net-pay panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:28:46 +00:00
Viktor Barzin	82371d1ef8	dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes All checks were successful ci/woodpecker/push/default Pipeline was successful Details MySQL device-write investigation (code-oflt): after the nextcloud webcal throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients), MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s, ~55 pages/s) is the doublewrite buffer writing every flushed page twice. Redo is negligible (0.01 MB/s), no temp-table spilling. Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer (~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps torn-page DETECTION metadata — a crash-torn page is flagged on recovery (restore from the daily mysqldump) rather than silently corrupt. Chosen over full OFF: same write saving, keeps detection, and OFF requires a shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable risk given the PERC BBU cache + UPS (in-flight writes complete on power loss) + daily per-db backups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 08:47:09 +00:00
Viktor Barzin	fbae573664	state(dbaas): update encrypted state	2026-06-30 08:46:45 +00:00
Viktor Barzin	71501be408	nodes: journald -> volatile (RAM) to cut sdc write-IOPS Some checks failed ci/woodpecker/push/default Pipeline failed Details Node "container churn" investigation (code-oflt): container logs (~30 KB/s) and overlayfs (~17 KB/s) are negligible; the node OS-disk churn is ext4 journal (jbd2) metadata writes driven mostly by journald's continuous appends. node4 + node5 had drifted to uncapped persistent journald (4 GB each, ~100 KB/s); master/node1-3 were correctly capped at 500M. Node + pod journals already ship to Loki (alloy loki.source.journal), so on-disk journald is pure write-IOPS overhead on the IOPS-bound sdc. Switch journald to Storage=volatile (RAM, RuntimeMaxUse=200M) fleet-wide: - cloud_init.yaml: drop-in 90-oflt-volatile.conf for new nodes (replaces the old persistent seds). - running nodes (master + node1-5): pushed the same drop-in via qm guest exec + journald restart + cleared /var/log/journal. Verified node5: OS-disk writers jbd2/sda1-8 931->46 KB/s, systemd-journal gone (~94% drop); ~4 GB freed each on node4/node5. Logs stay queryable in Loki. Trade-off: a hard crash loses the last unshipped journal. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 08:15:38 +00:00
Viktor Barzin	74819d4061	feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:57:40 +00:00
Viktor Barzin	1afe41880e	docs: MySQL buffer-pool/limit + nextcloud webcal throttle; VCT drift fixed All checks were successful ci/woodpecker/push/default Pipeline was successful Details Reflect the code-oflt MySQL write-reduction work (commit `82c9e69b` + the nextcloud webcal app-data throttle): - MySQL row: buffer pool 1->2Gi, mem limit 4->6Gi, and the nextcloud webcal calendar churn that was ~60% of MySQL's writes (now throttled in oc_calendarsubscriptions.refreshrate — app-data, can regress). - CNPG apply-gotcha note: the mysql_standalone VCT-annotation drift no longer needs -target dodging (now ignore_changes'd on the STS VCT). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:56:04 +00:00
Viktor Barzin	82c9e69b77	dbaas/mysql: 2Gi InnoDB buffer pool + 6Gi limit + ignore VCT drift Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt). Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live attribution found ~60% of its writes were nextcloud webcal calendar churn (throttled separately at the app layer); this addresses write amplification on the remainder: - innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free page -> constant flush-to-make-room write IOPS). - container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the headroom. One-time MySQL pod restart to apply. - ignore_changes on the StatefulSet volume_claim_template: the VCT is immutable post-creation and pvc-autoresizer rewrites its annotations on the live object, so TF's desired VCT could never apply and errored every broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the long-standing need to -target around it. Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy, 24 DBs reachable, restart clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:55:18 +00:00
Viktor Barzin	29bf275cef	state(dbaas): update encrypted state	2026-06-30 07:53:48 +00:00
Viktor Barzin	308a174ad6	docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation All checks were successful ci/woodpecker/push/default Pipeline was successful Details PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP (10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc still listed only four IPs in use / three dedicated. Add the .204 row to the allocation table, bump the counts (five in use, four dedicated, 5-IP layout), and add a LB-IP renumber-checklist entry for the out-of-band consumers (the go2rtc WebRTC candidate on the frigate config PVC and the HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE candidates, so the Service annotation is the single source of truth. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:42:27 +00:00
ebarzin	469cdd7507	frigate: expose go2rtc on a dedicated MetalLB LB IP (RTSP 8554 + WebRTC 8555) All checks were successful ci/woodpecker/push/default Pipeline was successful Details HA live video from the cluster Frigate hangs/fails because the only path to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203), which cannot carry RTSP or WebRTC. The container already listens on 8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 + WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN endpoint for native HLS/WebRTC live (parity with the Hikvision NVR). Companion non-Terraform steps are in the PR body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:15:22 +00:00

1 2 3 4 5 ...

4714 commits