infra

Author	SHA1	Message	Date
Viktor Barzin	e0991853e4	valia-sites: 25MB Pages-limit guard; cloudflared: drop removed{} (CI TF <1.7) Two fixes from the first live runs. (1) The sync job now skips a whole site when any file exceeds Cloudflare Pages' 25MB per-file cap, leaving current serving untouched — stem95su's stem_board.html references a 42.9MB stem_video.mp4, which made every run fail; the guard turns that into a loud skip so bridge keeps syncing. (2) The CI terraform is older than 1.7 and rejects removed{} blocks anywhere (pipelines 461/464), so the bridge record handoff was completed with a one-time manual 'tg state rm module.cloudflared.cloudflare_record.bridge_pages' from the main checkout; the block is deleted and the module comment records the manual step. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:43:13 +00:00
Viktor Barzin	695e020111	cloudflared: move bridge removed{} to stack root — removed blocks are root-module-only Some checks failed ci/woodpecker/push/default Pipeline failed Details Pipeline 461 failed terraform init: the removed{} handoff block sat in the stack-local module, but Terraform only allows removed blocks in the root module. Same intent, correct position (from = module.cloudflared.cloudflare_record.bridge_pages, destroy=false). Without this the stale state entry would make the next cloudflared apply destroy the record valia-sites now owns. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:31:53 +00:00
Viktor Barzin	8b80b4cc41	valia-sites: registry stack for Valia's Pages sites + declarative internal DNS (ADR-0018) Some checks failed Build valia-sites-sync / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline failed Details Valia keeps asking Viktor to host 1-page sites from her Drive folders; this makes it one map entry. New stacks/valia-sites: per site a CF Pages project + custom domain + proxied CNAME (bridge adopted via import{}), a ConfigMap feed (valia-sites-dns) the technitium ingress-dns-sync script now reconciles internal CNAMEs from (add/update/REMOVE — fixes the add-only stale-record gotcha), and one shared 10-min CronJob that mirrors each Content folder (rclone, drive.readonly, stem95su's guards) and wrangler-deploys ONLY on manifest change (free-tier deploy cap). Scoped CF Pages token + shared rclone conf in secret/valia-sites; the Global API Key never enters a pod. cloudflared forgets bridge's record via removed{} (no destroy). stem95su is in the map dns-parked (manage_dns=false) until its cutover commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 12:28:06 +00:00
Viktor Barzin	e1bd111562	rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to rename the 'мост' school static site to 'bridge'. New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already deployed and the custom domain attached; this renames the public CNAME (TF resource most_pages -> bridge_pages, destroy+create swaps the record) and the internal split-horizon static CNAME in the ingress-dns-sync CronJob. The old 'most' Pages project and the stale internal 'most' record are removed out-of-band after this applies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:52:30 +00:00
Viktor Barzin	7dd80b6c7c	technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The internal split-horizon zone is authoritative for viktorbarzin.me, so the new Cloudflare Pages site (most.viktorbarzin.me, added for Viktor's 'мост' school static site) NXDOMAINed for every internal client — LAN, VLANs and pods — while resolving fine externally. Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev) in the ingress-dns-sync CronJob next to the mail-auth records, and document the off-infra-site case in dns.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:10:46 +00:00
Viktor Barzin	217a54be9d	cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to host a static HTML site (the 'мост' school project, ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages with a custom domain, as a try-out of Pages hosting. The site content is deployed off-infra via wrangler to the Pages project 'most' (most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it. The custom domain is already attached to the Pages project and is waiting on this DNS record to validate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-03 10:06:33 +00:00
Viktor Barzin	08fb65827c	tripit: set PLACE_PHOTO_PROVIDER=wikipedia — real place preview photos All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked for place photos on the tripit Trip board. The app-side work (add-time photo fetch, board place cards) shipped in tripit v0.106.0, but prod never set PLACE_PHOTO_PROVIDER, so the fake provider would store placeholder PNGs for every hand-added place. Same class of fake-default gap as PLACE_RESOLVER_MODE (set explicitly for the same reason); the ADR-0035 rollout had left both the env flip and its backfill cron undone. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 21:57:21 +00:00
Viktor Barzin	248e186dce	CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor and emo are adding the first owned camera at the Sofia site (HiLook IPC-T241H-C watching the garage / server rack). Viktor asked to finalize emo's plan; the grilling session resolved emo's five open decisions and replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24), port-based VLAN split on the shared TL-SG105PE, camera default-deny with NTP-only egress, Frigate + ha-sofia as the only consumers. The PVE bridge, pfSense interface, Kea subnet and firewall rules were applied live this session (hand-managed hosts, backed up). This commit records the decision (ADR-0017), the glossary terms (Segment / CCTV segment), the as-built architecture doc, and bumps Frigate's ADR-0016 VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 20:01:45 +00:00
ebarzin	9e253d409a	immich(frame-emo): show photos from the last 365 days (was 730) Emil asked his Sofia Portal Mini photo-frame to show only the past year of photos rolling from today, instead of the last two years. Changes ImagesFromDays 730 -> 365 in the frame-emo Settings.yml. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 19:05:31 +00:00
Viktor Barzin	21afae85c9	dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor saw dawarich throwing 429s through Traefik and asked to loosen the burst for it. The access log confirms the burst pattern: one page load fires the whole fingerprinted-asset tail (SVG store badges, favicons, webmanifest) from a single client IP and trips the default 10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429). Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and authentik: dedicated dawarich-rate-limit middleware (average 100 / burst 1000) + skip_default_rate_limit on the dawarich ingress. Also updates the networking.md middleware enumerations (adding the previously undocumented tripit/health limiters alongside dawarich). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 15:03:08 +00:00
Viktor Barzin	91d0213d1a	Merge remote-tracking branch 'forgejo/master' into wizard/excalidraw-export-rename Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build excalidraw-library / build (push) Has been cancelled Details	2026-07-02 14:29:34 +00:00
Viktor Barzin	8fc657f431	excalidraw: migrate image build to GHA -> private ghcr (ADR-0002) The image was still built by hand and pushed to DockerHub (v1..v4), predating the all-builds-off-infra doctrine; Viktor chose to move it onto the standard pipeline while shipping the export/rename feature rather than keep the manual flow. Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml (go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns added to the Kyverno ghcr-credentials allowlist (package is PRIVATE), deployment now pins ghcr :latest with pullPolicy Always + pull secret, Keel force/match-tag/5m annotations seed the metadata (live values win via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image lists updated (also backfilled the missing k8s-portal rows in ci-cd.md). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:29:23 +00:00
Viktor Barzin	1cbc1e962b	excalidraw: native export menu + drawing rename Users couldn't see Excalidraw's built-in Save as / Export image options: the app's custom toolbar was drawn exactly on top of the native hamburger menu button, hiding it. Removed the overlay and integrated Back to Library / Save now / Rename into the native menu, so the native export formats (.excalidraw file, PNG, SVG, clipboard) are now reachable. Viktor asked for exports to work via the native Excalidraw feature and for drawings to be renameable by clicking their name. Rename: new PATCH /api/drawings/{id} endpoint (server-side name sanitization, 409 on conflict) + click-to-rename title pill in the editor (updates URL in place) + Rename button/modal in the dashboard. Existing GET/PUT/DELETE semantics unchanged for API compatibility (emo's upload pipeline). Added main_test.go (httptest) covering rename + existing handler behavior; dashboard rows now DOM-built (XSS-safe). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:29:10 +00:00
Viktor Barzin	d94f267c93	immich: upgrade v2.7.5 → v3.0.0 (postgres → vectorchord 0.4.3, frames → immich_v3 tag) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes, migration guide and release discussion #29439 reviewed — no config-breaking changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement). The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2'; Immich upgrades the extension itself at startup). Both photo frames switch to ImmichFrame's immich_v3 compatibility tag because every versioned ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API responses; repin to a versioned tag once upstream ships stable v3 support. Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so this commit is the source-of-truth record; the live rollout happens via kubectl set image in the same session. Pre-upgrade pg_dumpall taken (job postgresql-backup-pre-v3). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:18:22 +00:00
Viktor Barzin	6f03ccd1aa	excalidraw: grant emo-browser SA port-forward for drawing uploads All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to fix emo's permission so his Claude can upload to the Excalidraw service. emo's recent sessions show the documented upload recipe (kubectl port-forward svc/draw + X-Authentik-Username header, from his ~/.claude/CLAUDE.md) failing with: pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser in namespace excalidraw because his default kubeconfig is the read-only emo-browser SA (its port-forward grant covers only chrome-service) and his old admin kubeconfig at /home/emo/code/config expired and was removed. Add a namespace-scoped Role (pods/portforward create) + RoleBinding for that SA in the excalidraw namespace, mirroring the 2026-06-28 chrome-service grant. Trade-off (any-user drawings via the trusted username header) documented in the file and accepted. Also record the grant in docs/architecture/chrome-service.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 11:08:28 +00:00
Viktor Barzin	a64d2ba2b9	upgrades: fix hourly gotenberg error + cap update notifications at weekly All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor was getting upgrade-error Slack messages every hour and wants update notifications at most weekly. Root cause of the errors: Keel kept trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's require-trusted-registries denied it — gotenberg/* (and apache/, which tika will hit next) were never allowlisted, and Keel's Slack notifier at info level re-posted the identical failure to #general on every hourly poll since Jun 28. Changes: allowlist gotenberg/ + apache/* so the patch applies cleanly; disable Keel's direct Slack notifier and replace failure visibility with a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification plus the daily digest, never an hourly drip); remove diun's Slack notifier whose default message @channel-pinged #image-updates for every new upstream tag every 6h (the n8n upgrade-agent webhook feed is untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC). Paperless-ngx itself stays paused (keel policy=never, user-managed) while the ingest runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:16:50 +00:00
Viktor Barzin	dab307f9f8	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-02 05:39:15 +00:00
Viktor Barzin	f1e81772d5	broker-sync: repoint image to ghcr (was frozen on pre-migration DockerHub) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The nightly ibkr sync failed with 'No such command ibkr': every broker-sync CronJob still pulled viktorbarzin/broker-sync:latest from DockerHub, which nothing has pushed to since the ADR-0002 move to GHA->ghcr on 2026-06-13 — the jobs were silently running a frozen pre-ibkr build. The migration had allowlisted only the wealthfolio namespace for the private ghcr.io/viktorbarzin/wealthfolio-sync image, so broker-sync also lacked pull credentials. Repoint the image, add ghcr-credentials imagePullSecrets to all eight CronJobs, and allowlist the broker-sync namespace (wealthfolio stays — its own monthly sync pulls the same image). Related: code-9ko8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 05:31:00 +00:00
Viktor Barzin	ac41e7c017	nvidia: run advertise-gpumem provisioner under bash (dash rejects pipefail) First apply of ADR-0016 failed: terraform local-exec defaults to /bin/sh, which on Ubuntu is dash — 'set -euo pipefail' exits 2 before running kubectl. Pin the interpreter to bash. Everything else in the gpumem apply succeeded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 05:21:47 +00:00
Viktor Barzin	968b2b9c64	Merge remote-tracking branch 'origin/master' into wizard/gpu-vram-budget	2026-07-02 05:18:34 +00:00
Viktor Barzin	a12b09af04	broker-sync: pin data-mounting CronJobs to k8s-node4 (stop nightly RWO wedge) All checks were successful ci/woodpecker/push/default Pipeline was successful Details All broker-sync CronJobs share one RWO proxmox-lvm volume. With free scheduling the nightly 02:00-04:15 runs land on different nodes, forcing a detach/attach cycle whose QMP hotplug intermittently ghost-attaches on disk-heavy VMs — every job then sits in ContainerCreating for hours (happened 2026-06-30, 07-01 and again 07-02; fires PodsStuckContainerCreating and skips the day's trade syncs). Pinning all seven volume-mounting jobs to k8s-node4 (fewest CSI disks, 11) makes the volume attach once and stay put — no hotplug dance, no wedge. version_probe mounts nothing and stays unpinned. Durable fix for the recurrence tracked in beads code-9ko8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 05:16:38 +00:00
Viktor Barzin	3c85af2dc2	fire-countdown dashboard: SQL guards + tax regime + honesty fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details From the flaw-hunt workflow (all verified): - Projected-FIRE-date panels (solo/household/family) now guard savings £/yr: 0 / empty / negative all render "Set savings £/yr" instead of a blank tile, a SQL error, or a nonsensical past date ("Jan 1849"). Verified across cases. - New "Tax regime" panel surfaces the per-country jurisdiction — 14/22 countries fall back to the neutral 'nomad' 1% assumption, which was previously invisible. - Intro no longer hard-codes "£139k pension" (contradicted the £328k tranche panel); pension value is now only shown data-bound in the tranche panel. - Intro adds caveats: Anca's spend is an estimate (pending live re-pull), and non-modelled countries use the nomad tax fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-01 22:44:17 +00:00
Viktor Barzin	339f5d89b9	onlyoffice: decommission (stack destroyed, dir removed) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The document server had been deliberately scaled to 0/0 for 184 days, but its ingress kept the uptime-kuma monitors alive, so 'onlyoffice down' showed up in every daily alert digest. Viktor approved tearing it down. terragrunt destroy ran clean (11 resources) before this commit; the kuma monitors auto-prune with the ingress. Also drops the onlyoffice/* image prefix from the kyverno trusted-registries allowlist, the service-catalog rows, and updates the nextcloud collabora comment. Document data (if any) remains on the PVE NFS share. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-01 22:35:22 +00:00
Viktor Barzin	3c476dab32	postiz+portal: remove broken alert sources (stale backup CronJob, bogus scrape annotations) Viktor is getting daily Slack alert noise; these two were the recurring generators. The postiz-postgres-backup CronJob still dumped from the old in-namespace postiz-postgresql service that was removed in the CNPG migration (2026-06-28) — it failed every night at 03:00 and re-fired BackupCronJobFailed each day. The postiz DB now lives on the shared CNPG cluster and is already covered by the dbaas per-db dumps, so the CronJob (and its NFS backup volume) is redundant and removed rather than repaired. portal-stt/portal-tts advertised prometheus.io scrape annotations that never worked: the deployed Speaches build 404s /metrics, and openai-edge-tts has no metrics at all (its annotation pointed at a JSON endpoint, which fails exposition parsing regardless). Both produced a permanently firing ScrapeTargetDown. Annotations removed until the apps actually serve metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-01 22:35:21 +00:00
Viktor Barzin	5a312563c6	monitoring/wealth: dash the in-progress year on the hourly-rate panel All checks were successful ci/woodpecker/push/default Pipeline was successful Details The current, still-accruing calendar year read misleadingly high (e.g. 2026 at 5 months showed £149/h gross, above all of 2025) because the full-year bonus - paid every March - plus front-loaded quarterly RSU vests get divided by only the months worked so far. It settles lower as the year completes. Split each line into a solid series (complete years) and a dashed series (the latest, still-accruing year), so the provisional point is visually flagged. The split auto-detects the in-progress year (latest year with < 12 months of payslips), so it needs no per-year maintenance. Panel description now explains the caveat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:45:51 +00:00
Viktor Barzin	28984dda9a	monitoring/wealth: add per-year effective hourly-rate panel (gross vs net) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted to see, on the wealth dashboard, the hourly wage he earned each year - both gross and net - with year on the X axis. New timeseries (line) panel "Effective hourly rate - gross vs net": - hourly = annual pay / hours worked; hours = contractual 40h/week (2,080h per full year, confirmed from the Facebook/Meta UK offer letter: Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually worked so partial years (2019, 2020, 2026) read correctly. - Gross = gross_pay incl. notional RSU vest; Net = take-home. - timeFrom 10y so all years show under the dashboard's default 180d range. Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload of doc 33) was removed separately, so 2023 is no longer double-counted; this also corrects the existing net-pay panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:28:46 +00:00
Viktor Barzin	82371d1ef8	dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes All checks were successful ci/woodpecker/push/default Pipeline was successful Details MySQL device-write investigation (code-oflt): after the nextcloud webcal throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients), MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s, ~55 pages/s) is the doublewrite buffer writing every flushed page twice. Redo is negligible (0.01 MB/s), no temp-table spilling. Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer (~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps torn-page DETECTION metadata — a crash-torn page is flagged on recovery (restore from the daily mysqldump) rather than silently corrupt. Chosen over full OFF: same write saving, keeps detection, and OFF requires a shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable risk given the PERC BBU cache + UPS (in-flight writes complete on power loss) + daily per-db backups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 08:47:09 +00:00
Viktor Barzin	74819d4061	feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:57:40 +00:00
Viktor Barzin	82c9e69b77	dbaas/mysql: 2Gi InnoDB buffer pool + 6Gi limit + ignore VCT drift Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt). Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live attribution found ~60% of its writes were nextcloud webcal calendar churn (throttled separately at the app layer); this addresses write amplification on the remainder: - innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free page -> constant flush-to-make-room write IOPS). - container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the headroom. One-time MySQL pod restart to apply. - ignore_changes on the StatefulSet volume_claim_template: the VCT is immutable post-creation and pvc-autoresizer rewrites its annotations on the live object, so TF's desired VCT could never apply and errored every broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the long-standing need to -target around it. Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy, 24 DBs reachable, restart clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:55:18 +00:00
ebarzin	469cdd7507	frigate: expose go2rtc on a dedicated MetalLB LB IP (RTSP 8554 + WebRTC 8555) All checks were successful ci/woodpecker/push/default Pipeline was successful Details HA live video from the cluster Frigate hangs/fails because the only path to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203), which cannot carry RTSP or WebRTC. The container already listens on 8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 + WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN endpoint for native HLS/WebRTC live (parity with the Hikvision NVR). Companion non-Terraform steps are in the PR body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:15:22 +00:00
Viktor Barzin	9ea9cae073	rightsize: reconcile batch-2/3 stacks blocked by killed #427 (job-hunter, wealthfolio, f1-stream) Some checks failed ci/woodpecker/push/default Pipeline failed Details Memory limits were committed (batch 2/3) but pipeline #427 was killed mid-apply and the local homelab tf apply hit a stale backend-init; this comment-only diff re-triggers a clean CI apply for the three stacks so live matches master (job-hunter 768Mi, wealthfolio 512Mi, f1-stream 384Mi). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:59:41 +00:00
Viktor Barzin	7cc9cde5b1	external-secrets: enable ESO Vault token cache to cut sdc write churn All checks were successful ci/woodpecker/push/default Pipeline was successful Details Add --enable-vault-token-cache to the ESO controller (a graduated, non-experimental flag in chart 2.6.0). Until now ESO authenticated to Vault with login -> lookup-self -> revoke-self on every secret fetch. Across 92 ExternalSecrets refreshing every 15m that measured ~0.22 logins/s + ~0.22 revoke-self/s on the active Vault member, and each cycle is a token create+revoke (plus its lease) written to the Raft log on all three members. Those fsync-heavy writes land on the contended PVE RAID1 7200rpm HDD (sdc) -- one of the write sources behind the recurring control-plane flaps (code-oflt write-reduction). The eso kubernetes-auth role already issues a 240h periodic, unlimited- use token, so the churn was pure waste: ESO discarded a perfectly good token after a single use. With token caching ESO mints one token and reuses/renews it, collapsing logins from ~13/min to a handful per token lifetime. Verified live: vault cache initialized, 112/113 ExternalSecrets Ready (the one failure, instagram-poster, is pre-existing data drift unrelated to auth), logins dropped to ~0 after warm-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:32:37 +00:00
Viktor Barzin	bc626a2d89	rightsize: raise OOM-tight memory limits (batch 3/N — spike protection) Some checks failed ci/woodpecker/push/default Pipeline failed Details shlink 512->704Mi, linkwarden 1Gi->1280Mi, chrome-service 2Gi->2624Mi, forgejo 4Gi->5Gi, f1-stream 256->384Mi. All were request==limit with 30d peak at 91-100% of the ceiling — a spike would OOM-kill them. Raising the limit (now Burstable, request<limit) gives real burst headroom. This is the genuine 'don't OOM on occasional spike' fix. Small add (~2.2Gi limits) vs the ~20Gi of fat removed in batches 1-2, so net overcommit keeps dropping. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:28:11 +00:00
Viktor Barzin	418d1efb4b	rightsize: trim over-provisioned memory (batch 2/N) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details claude-agent-service 12Gi->3Gi (peak 585Mi — the single biggest fat, ~9Gi of limit-overcommit removed), job-hunter 1280->768Mi (kept chromium headroom; 30d peak 118Mi), fire-planner 1024->320Mi, wealthfolio 1Gi->512Mi (kept history-growth headroom). Burstable, limits kept >= generous peak headroom, never below peak. ~10.7Gi of limit overcommit removed. paperless-ai intentionally LEFT at 4Gi (documented in-process RAG model load). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:27:17 +00:00
Viktor Barzin	c3553731c7	dbaas: CNPG write-reduction — archive_timeout=0, commit_delay, wal_compression=zstd Part of code-oflt (cut sdc write IOPS before the SSD move; analysis #6922). - archive_timeout 300->0: CNPG forces archive_mode=on but .spec.backup is empty (no ObjectStore), so a 16MB WAL segment switch every 5min shipped NOWHERE = ~4.6 GB/day of pure-waste WAL on the contended sdc. archive_mode stays CNPG-on (reserved); 0 just stops the timed switch. Daily pg_dump cron unchanged. - commit_delay 0->2500us: group-commit coalesces concurrent fsyncs. SAFE for every DB incl financial -- data still fsynced before COMMIT acks, only <=2.5ms added latency under concurrency. - wal_compression pglz->zstd: ~30-50% smaller full-page images. All sighup-reloadable. Applied via targeted apply of module.dbaas.null_resource.pg_cluster (trigger bumped) to avoid the pre-existing mysql VCT drift that breaks broad dbaas applies. Refs: code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:16:38 +00:00
Viktor Barzin	5d059786a1	rightsize: trim over-provisioned memory limits+requests (batch 1/N) All checks were successful ci/woodpecker/push/default Pipeline was successful Details claude-breakglass 4Gi->512Mi, stirling-pdf 1536->512Mi, insta2spotify 2Gi->256Mi, recruiter-responder 768->256Mi. These idle/utility services had memory LIMITS sitting 4-15x above their 30d peak, inflating cluster limit-overcommit to 142% across the 5 post-node6 nodes. Burstable (request<limit), limits capped at ~peak x1.5 (never below peak), so no OOM risk (verified zero OOMKills cluster-wide in 30d). Reduces phantom limit overcommit + frees scheduler requests. Follows the 3-reviewer adversarial review: raising limits on an already-overcommitted cluster worsens correlated node-OOM; the real fix is trimming the fat. Limits only lowered where peak is far below; tuned/DB/GPU limits untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 14:46:58 +00:00
Viktor Barzin	256122ff5b	monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic All checks were successful ci/woodpecker/push/default Pipeline was successful Details The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live). Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:34:01 +00:00
Viktor Barzin	c0e0911afa	dbaas: bump pg_cluster trigger so the checkpoint/WAL params actually apply `a2c8f906` added checkpoint_timeout=15min + max/min_wal_size to the CNPG Cluster YAML, but the cluster is applied via null_resource.pg_cluster + local-exec kubectl apply, which only re-runs when its `triggers` change. The YAML edit didn't bump a trigger, so the change was inert and never applied (incl. via CI). Bump the pg_params trigger so the kubectl apply re-runs and CNPG hot-reloads the new params (reloadable, no restart). Landing it via a targeted apply (-target=null_resource.pg_cluster) to avoid 3 pre-existing unrelated drifts in this stack -- notably a mysql_standalone volumeClaimTemplate annotation diff the apiserver rejects as immutable, which is what fails broad dbaas applies (and silently blocked `a2c8f906`). Refs: code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:25:37 +00:00
Viktor Barzin	a2c8f906ec	dbaas: stretch CNPG checkpoint timer 5->15min + raise WAL size (cut sdc write IOPS) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each triggering a full-page-write burst + flush onto the contended 7200rpm sdc spindle -- a top write-IOPS source after etcd. Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled rather than churned. All three are sighup-reloadable -> CNPG applies them without a restart or failover. checkpoint_completion_target stays 0.9 so each checkpoint's IO is still smeared across the interval. Bounded recovery-time tradeoff (more WAL to replay on crash), acceptable for the write relief. wal_compression left at pglz ('on') pending image zstd-support verification. Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB / 2560MB / 3Gi). Refs: code-oflt (etcd/sdc IO isolation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 11:41:09 +00:00
Viktor Barzin	3398873a16	k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security) uptake now lags up to 7 days instead of <=1. - var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC) - var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h after the Sunday check, so nightly-report.py's ~25h staleness threshold stays valid AND still flags a missed weekly run; no STALE_SECONDS change needed) The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename = churn). Cadence wording updated across main.tf comments, nightly-report.py docstring, and the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 06:22:20 +00:00
Viktor Barzin	e43e64c666	kyverno: disable reports-controller to stop etcd ephemeralreport load All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd writes if etcd moves there. Investigation found the avoidable load is kyverno reporting: the 2026-06-12 etcd-load-reduction disabled the report features but left the reports-controller running (default --enableReporting + --validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in etcd) that nothing reaps (aggregation off). Listing that range starves etcd's fdatasync enough to flap the apiserver (observed live 2026-06-28). Disable the reports-controller outright (reportsController.enabled=false), completing the 2026-06-12 intent. Reports are not consumed (violations surface via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are independent of it. The ~10.5k stale reports already in etcd are cleared separately (throttled, out-of-band) since bulk-deleting them is itself etcd-heavy. Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 05:35:36 +00:00
Viktor Barzin	cf42042cba	monitoring: re-trigger apply to persist state after CI cancel-race All checks were successful ci/woodpecker/push/default Pipeline was successful Details No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`. The pfSense egress-monitoring apply (commit `7fe2d978`, CI pipeline #414) was cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources applied (probes green, rules loaded) but the Terraform state write and the helm release finalize were lost, leaving the prometheus release stuck in pending-upgrade (manually unstuck). This commit re-applies the unchanged monitoring stack so state matches live, with zero resource changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:58:49 +00:00
Viktor Barzin	f92075b7c5	fire-planner: solve FIRE targets to age 100 (horizon 60→72) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor plans to live to 100, so the portfolio must last that long. The fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72 (retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years to fund). A one-off in-cluster job re-solves the existing rows at the new horizon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:49:20 +00:00
Viktor Barzin	7fe2d9780e	monitoring: add pfSense WAN/egress alerting + probes Some checks failed ci/woodpecker/push/default Pipeline was canceled Details On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for ~20 min while internal routing + Unbound stayed up; recovery needed a manual reboot and NOTHING alerted — there was no egress probe and the cloudflared replica metric stayed green. Add first-class egress monitoring so the next occurrence pages in ~2 min instead of being noticed by a human. - blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW so ICMP can use raw sockets). - Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 + 1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers). - Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable, InternetEgressDown (both providers dead), ExternalDNSResolutionDown, EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's exact "external down while internal up" signature), PfSenseVMDown. - Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the cloudflared replica metric is blind to tunnel-connection loss. Threshold calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident). - Alertmanager inhibit: WAN/egress-down suppresses the downstream egress symptom alerts so one root alert pages, not a storm. - Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md. All metric names + the cloudflared threshold verified against live Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening (dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred and documented in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:46:30 +00:00
Viktor Barzin	6f042ee239	fix(fire-planner): grafana fire-planner-pg datasource survives pw rotation Some checks failed ci/woodpecker/push/default Pipeline failed Details The fire-planner-pg Grafana datasource baked the rotating fire_planner DB password into its provisioning ConfigMap at terraform plan-time, so on every 7-day static-role rotation the password went stale and ALL fire-planner-pg dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown) silently failed with "password authentication failed for user fire_planner" until the next stack apply. Switch to the same live-env pattern wealth-pg / payslips-pg already use: - new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader match) mirrors the rotating Vault static-creds/pg-fire-planner password - datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD} - Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on rotation so the provisioned datasource never goes stale Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:14:42 +00:00
Viktor Barzin	35c0057d83	chrome-service: raise noVNC sidecar memory limit 96Mi->256Mi (fix OOMKill) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly whenever someone actively opened chrome.viktorbarzin.me — the view connected then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify framebuffer/encode buffers spike past the 96Mi cap when streaming the 1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi (Burstable, aux tier). Already applied live via a transient kubectl patch (Recreate rollout, verified 0 restarts since); this lands the durable state so the next apply / daily drift-detection doesn't revert it to 96Mi. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:39:17 +00:00
Viktor Barzin	2e50c1235c	chrome-service: grant emo shared browser access (noVNC + homelab browser CLI) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to give emo access to the cluster's headed Chrome so he can fill in forms and get past anti-bot / captcha pages. emo was deliberately locked out of chrome-service (noVNC Authentik allowlist was Viktor-only + his power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE his existing browser rather than stand up an isolated per-user instance, accepting that emo can therefore reach Viktor's warmed logged-in sessions (CDP has no per-context auth, so the single shared persistent profile is reachable by anyone who can drive the browser). emo's CLI use is hands-off (his agent can run it unattended). - authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED so the admin-services-restriction policy admits him to chrome.viktorbarzin.me (noVNC). Reverses the prior Viktor-only lock; comment updated to record why. - chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token (dashboard-sa.tf pattern), a chrome-service-portforward Role granting pods/portforward, and a cluster read-only binding (oidc-power-user-readonly) so the SA can resolve the Service and emo's normal read access doesn't regress. - t3-provision-users.sh: install_browser_kubeconfig installs a dual-context kubeconfig for any user with a <user>-browser SA — SA token as the default context (non-interactive, works headless), personal OIDC retained as the oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the headless agent session that homelab browser needs. - docs/architecture/chrome-service.md: document the shared-browser multi-user access model, the session-exposure trade-off, and how to grant/revoke a user. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:20:07 +00:00
Viktor Barzin	50077b43d4	paperless-ngx: drop TASK_WORKERS 6->4 (6 OOMKilled the pod mid-import) All checks were successful ci/woodpecker/push/default Pipeline was successful Details 6 OCR workers crept past the 8Gi per-container memory cap over ~6h and OOMKilled paperless at 15:00 during the Emo bulk import. The import auto-recovered (the consume dir lives on the PVC, so a restart re-scans and reprocesses — nothing lost), but it left the queue inflated with re-queued duplicates and spiked etcd on each restart. The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth raising for one namespace. 4 workers fit with headroom (4 measured ~1.3Gi). Matches the value applied live via `kubectl set env` during incident response; this removes the drift so the next apply keeps it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:06:46 +00:00
Viktor Barzin	8236ae309d	postiz: reconcile HCL to live (adopt unmerged stack config), keep parked All checks were successful ci/woodpecker/push/default Pipeline was successful Details postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik OIDC + static-DB password) came from the never-merged branch `wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt apply` would have DESTROYED the stack. This lands that postiz config to master so HCL == state == live (CI green; destroy-landmine gone). Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta- blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"), which is why it was parked; IG runs via the instagram-poster service. To revive later: flip postiz `replicaCount` + temporal `replicas` back to 1 and re-check image pins. Notes captured in this reconcile: - ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24"; caught + rolled back during this work). - The 4 Authentik resources (app/provider/group/binding) were re-imported into state (adopted, not recreated — no duplicate AK objects); the obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its synced secret was kept). - Vault-side cleanup (removing the unused pg-postiz rotated role) is deliberately NOT included here — deferred, postiz uses a static secret/postiz database_url. State was already reconciled by a local `scripts/tg apply`; this commit is the HCL catch-up (CI re-apply is a no-op). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:54:59 +00:00
Viktor Barzin	e518ada3d4	authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets the SFE too, and the SFE login shows social-login buttons (emo is Google-only with no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md + authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:53:26 +00:00

1 2 3 4 5 ...

1639 commits