infra

Author	SHA1	Message	Date
Viktor Barzin	684ca4527c	docs(CLAUDE.md): T4 now has a VRAM budget + watchdog (ADR-0016, dry-run); note llama-swap budget miscalibration All checks were successful ci/woodpecker/push/default Pipeline was successful Details Session wrap-up doc sync: the Immich note still claimed the shared T4 had no VRAM isolation. Record the gpumem budget/watchdog shipped earlier today, that the watchdog is observe-only, and that budgets need a retune (llama-swap's real 16k-ctx resident is ~7GB, not 4.35) before arming. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 15:20:06 +00:00
Viktor Barzin	21afae85c9	dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor saw dawarich throwing 429s through Traefik and asked to loosen the burst for it. The access log confirms the burst pattern: one page load fires the whole fingerprinted-asset tail (SVG store badges, favicons, webmanifest) from a single client IP and trips the default 10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429). Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and authentik: dedicated dawarich-rate-limit middleware (average 100 / burst 1000) + skip_default_rate_limit on the dawarich ingress. Also updates the networking.md middleware enumerations (adding the previously undocumented tripit/health limiters alongside dawarich). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 15:03:08 +00:00
Viktor Barzin	91d0213d1a	Merge remote-tracking branch 'forgejo/master' into wizard/excalidraw-export-rename Some checks failed ci/woodpecker/push/default Pipeline was successful Details Build excalidraw-library / build (push) Has been cancelled Details	2026-07-02 14:29:34 +00:00
Viktor Barzin	8fc657f431	excalidraw: migrate image build to GHA -> private ghcr (ADR-0002) The image was still built by hand and pushed to DockerHub (v1..v4), predating the all-builds-off-infra doctrine; Viktor chose to move it onto the standard pipeline while shipping the export/rename feature rather than keep the manual flow. Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml (go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns added to the Kyverno ghcr-credentials allowlist (package is PRIVATE), deployment now pins ghcr :latest with pullPolicy Always + pull secret, Keel force/match-tag/5m annotations seed the metadata (live values win via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image lists updated (also backfilled the missing k8s-portal rows in ci-cd.md). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:29:23 +00:00
Viktor Barzin	1cbc1e962b	excalidraw: native export menu + drawing rename Users couldn't see Excalidraw's built-in Save as / Export image options: the app's custom toolbar was drawn exactly on top of the native hamburger menu button, hiding it. Removed the overlay and integrated Back to Library / Save now / Rename into the native menu, so the native export formats (.excalidraw file, PNG, SVG, clipboard) are now reachable. Viktor asked for exports to work via the native Excalidraw feature and for drawings to be renameable by clicking their name. Rename: new PATCH /api/drawings/{id} endpoint (server-side name sanitization, 409 on conflict) + click-to-rename title pill in the editor (updates URL in place) + Rename button/modal in the dashboard. Existing GET/PUT/DELETE semantics unchanged for API compatibility (emo's upload pipeline). Added main_test.go (httptest) covering rename + existing handler behavior; dashboard rows now DOM-built (XSS-safe). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:29:10 +00:00
Viktor Barzin	d94f267c93	immich: upgrade v2.7.5 → v3.0.0 (postgres → vectorchord 0.4.3, frames → immich_v3 tag) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes, migration guide and release discussion #29439 reviewed — no config-breaking changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement). The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2'; Immich upgrades the extension itself at startup). Both photo frames switch to ImmichFrame's immich_v3 compatibility tag because every versioned ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API responses; repin to a versioned tag once upstream ships stable v3 support. Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so this commit is the source-of-truth record; the live rollout happens via kubectl set image in the same session. Pre-upgrade pg_dumpall taken (job postgresql-backup-pre-v3). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:18:22 +00:00
Viktor Barzin	6f03ccd1aa	excalidraw: grant emo-browser SA port-forward for drawing uploads All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to fix emo's permission so his Claude can upload to the Excalidraw service. emo's recent sessions show the documented upload recipe (kubectl port-forward svc/draw + X-Authentik-Username header, from his ~/.claude/CLAUDE.md) failing with: pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser in namespace excalidraw because his default kubeconfig is the read-only emo-browser SA (its port-forward grant covers only chrome-service) and his old admin kubeconfig at /home/emo/code/config expired and was removed. Add a namespace-scoped Role (pods/portforward create) + RoleBinding for that SA in the excalidraw namespace, mirroring the 2026-06-28 chrome-service grant. Trade-off (any-user drawings via the trusted username header) documented in the file and accepted. Also record the grant in docs/architecture/chrome-service.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 11:08:28 +00:00
Viktor Barzin	88c86e2109	ci: Slack-notify failed pipeline runs only All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor doesn't want a Slack message for every CI run — only failures. The infra apply pipeline posted a status line to #general on every push, and the renew-tls / postmortem-todos / registry-config-sync / pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine messages a week). Now: the apply pipeline's success post is gone (notify-failure already covers failures), all cron notifies are status:[failure] with explicit FAILED texts, and drift-detection is silent when all stacks are clean (still posts drift findings and errors, and gains a hard-failure catch step it previously lacked). Kept: notify-nonadmin-push (org audit feed) and the actionable provision-user post. Per-app deploy template in ci-cd.md updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:27:43 +00:00
Viktor Barzin	a64d2ba2b9	upgrades: fix hourly gotenberg error + cap update notifications at weekly All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor was getting upgrade-error Slack messages every hour and wants update notifications at most weekly. Root cause of the errors: Keel kept trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's require-trusted-registries denied it — gotenberg/* (and apache/, which tika will hit next) were never allowlisted, and Keel's Slack notifier at info level re-posted the identical failure to #general on every hourly poll since Jun 28. Changes: allowlist gotenberg/ + apache/* so the patch applies cleanly; disable Keel's direct Slack notifier and replace failure visibility with a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification plus the daily digest, never an hourly drip); remove diun's Slack notifier whose default message @channel-pinged #image-updates for every new upstream tag every 6h (the n8n upgrade-agent webhook feed is untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC). Paperless-ngx itself stays paused (keel policy=never, user-managed) while the ingest runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:16:50 +00:00
Viktor Barzin	5d5d9752cb	guard: ignore + git-crypt kubeconfig files so they can't leak to the public mirror All checks were successful ci/woodpecker/push/default Pipeline was successful Details A GitGuardian audit of the infra repo showed the recent alerts were test fixtures (false positives), but surfaced a real historical leak: a cluster-admin kubeconfig was once committed as stacks/f1-stream/.../.config (now expired, reachable only via a GitHub PR ref). The .gitignore already had a `config` rule for kubeconfigs but missed the dotfile form `.config` — which is exactly how that file slipped onto the public mirror. Close the gap in two layers: - .gitignore: also ignore `.config`, `kubeconfig`, `.kubeconfig`, `admin.conf`, `.kube/` so they're never staged by accident. - .gitattributes: route `.config`, `kubeconfig`, `.kubeconfig`, `admin.conf` through git-crypt so a force-add or rename still lands as ciphertext (never plaintext) on the public GitHub mirror. No tracked files match these names today, so there is zero retroactive impact — purely forward-looking prevention. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:14:58 +00:00
Viktor Barzin	dab307f9f8	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-02 05:39:15 +00:00
Viktor Barzin	f1e81772d5	broker-sync: repoint image to ghcr (was frozen on pre-migration DockerHub) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The nightly ibkr sync failed with 'No such command ibkr': every broker-sync CronJob still pulled viktorbarzin/broker-sync:latest from DockerHub, which nothing has pushed to since the ADR-0002 move to GHA->ghcr on 2026-06-13 — the jobs were silently running a frozen pre-ibkr build. The migration had allowlisted only the wealthfolio namespace for the private ghcr.io/viktorbarzin/wealthfolio-sync image, so broker-sync also lacked pull credentials. Repoint the image, add ghcr-credentials imagePullSecrets to all eight CronJobs, and allowlist the broker-sync namespace (wealthfolio stays — its own monthly sync pulls the same image). Related: code-9ko8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 05:31:00 +00:00
Viktor Barzin	ac41e7c017	nvidia: run advertise-gpumem provisioner under bash (dash rejects pipefail) First apply of ADR-0016 failed: terraform local-exec defaults to /bin/sh, which on Ubuntu is dash — 'set -euo pipefail' exits 2 before running kubectl. Pin the interpreter to bash. Everything else in the gpumem apply succeeded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 05:21:47 +00:00
Viktor Barzin	968b2b9c64	Merge remote-tracking branch 'origin/master' into wizard/gpu-vram-budget	2026-07-02 05:18:34 +00:00
Viktor Barzin	a12b09af04	broker-sync: pin data-mounting CronJobs to k8s-node4 (stop nightly RWO wedge) All checks were successful ci/woodpecker/push/default Pipeline was successful Details All broker-sync CronJobs share one RWO proxmox-lvm volume. With free scheduling the nightly 02:00-04:15 runs land on different nodes, forcing a detach/attach cycle whose QMP hotplug intermittently ghost-attaches on disk-heavy VMs — every job then sits in ContainerCreating for hours (happened 2026-06-30, 07-01 and again 07-02; fires PodsStuckContainerCreating and skips the day's trade syncs). Pinning all seven volume-mounting jobs to k8s-node4 (fewest CSI disks, 11) makes the volume attach once and stay put — no hotplug dance, no wedge. version_probe mounts nothing and stays unpinned. Durable fix for the recurrence tracked in beads code-9ko8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 05:16:38 +00:00
Viktor Barzin	3c85af2dc2	fire-countdown dashboard: SQL guards + tax regime + honesty fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details From the flaw-hunt workflow (all verified): - Projected-FIRE-date panels (solo/household/family) now guard savings £/yr: 0 / empty / negative all render "Set savings £/yr" instead of a blank tile, a SQL error, or a nonsensical past date ("Jan 1849"). Verified across cases. - New "Tax regime" panel surfaces the per-country jurisdiction — 14/22 countries fall back to the neutral 'nomad' 1% assumption, which was previously invisible. - Intro no longer hard-codes "£139k pension" (contradicted the £328k tranche panel); pension value is now only shown data-bound in the tranche panel. - Intro adds caveats: Anca's spend is an estimate (pending live re-pull), and non-modelled countries use the nomad tax fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-01 22:44:17 +00:00
Viktor Barzin	339f5d89b9	onlyoffice: decommission (stack destroyed, dir removed) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The document server had been deliberately scaled to 0/0 for 184 days, but its ingress kept the uptime-kuma monitors alive, so 'onlyoffice down' showed up in every daily alert digest. Viktor approved tearing it down. terragrunt destroy ran clean (11 resources) before this commit; the kuma monitors auto-prune with the ingress. Also drops the onlyoffice/* image prefix from the kyverno trusted-registries allowlist, the service-catalog rows, and updates the nextcloud collabora comment. Document data (if any) remains on the PVE NFS share. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-01 22:35:22 +00:00
Viktor Barzin	3c476dab32	postiz+portal: remove broken alert sources (stale backup CronJob, bogus scrape annotations) Viktor is getting daily Slack alert noise; these two were the recurring generators. The postiz-postgres-backup CronJob still dumped from the old in-namespace postiz-postgresql service that was removed in the CNPG migration (2026-06-28) — it failed every night at 03:00 and re-fired BackupCronJobFailed each day. The postiz DB now lives on the shared CNPG cluster and is already covered by the dbaas per-db dumps, so the CronJob (and its NFS backup volume) is redundant and removed rather than repaired. portal-stt/portal-tts advertised prometheus.io scrape annotations that never worked: the deployed Speaches build 404s /metrics, and openai-edge-tts has no metrics at all (its annotation pointed at a JSON endpoint, which fails exposition parsing regardless). Both produced a permanently firing ScrapeTargetDown. Annotations removed until the apps actually serve metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-01 22:35:21 +00:00
Viktor Barzin	5a312563c6	monitoring/wealth: dash the in-progress year on the hourly-rate panel All checks were successful ci/woodpecker/push/default Pipeline was successful Details The current, still-accruing calendar year read misleadingly high (e.g. 2026 at 5 months showed £149/h gross, above all of 2025) because the full-year bonus - paid every March - plus front-loaded quarterly RSU vests get divided by only the months worked so far. It settles lower as the year completes. Split each line into a solid series (complete years) and a dashed series (the latest, still-accruing year), so the provisional point is visually flagged. The split auto-detects the in-progress year (latest year with < 12 months of payslips), so it needs no per-year maintenance. Panel description now explains the caveat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:45:51 +00:00
Viktor Barzin	28984dda9a	monitoring/wealth: add per-year effective hourly-rate panel (gross vs net) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted to see, on the wealth dashboard, the hourly wage he earned each year - both gross and net - with year on the X axis. New timeseries (line) panel "Effective hourly rate - gross vs net": - hourly = annual pay / hours worked; hours = contractual 40h/week (2,080h per full year, confirmed from the Facebook/Meta UK offer letter: Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually worked so partial years (2019, 2020, 2026) read correctly. - Gross = gross_pay incl. notional RSU vest; Net = take-home. - timeFrom 10y so all years show under the dashboard's default 180d range. Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload of doc 33) was removed separately, so 2023 is no longer double-counted; this also corrects the existing net-pay panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:28:46 +00:00
Viktor Barzin	82371d1ef8	dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes All checks were successful ci/woodpecker/push/default Pipeline was successful Details MySQL device-write investigation (code-oflt): after the nextcloud webcal throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients), MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s, ~55 pages/s) is the doublewrite buffer writing every flushed page twice. Redo is negligible (0.01 MB/s), no temp-table spilling. Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer (~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps torn-page DETECTION metadata — a crash-torn page is flagged on recovery (restore from the daily mysqldump) rather than silently corrupt. Chosen over full OFF: same write saving, keeps detection, and OFF requires a shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable risk given the PERC BBU cache + UPS (in-flight writes complete on power loss) + daily per-db backups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 08:47:09 +00:00
Viktor Barzin	fbae573664	state(dbaas): update encrypted state	2026-06-30 08:46:45 +00:00
Viktor Barzin	71501be408	nodes: journald -> volatile (RAM) to cut sdc write-IOPS Some checks failed ci/woodpecker/push/default Pipeline failed Details Node "container churn" investigation (code-oflt): container logs (~30 KB/s) and overlayfs (~17 KB/s) are negligible; the node OS-disk churn is ext4 journal (jbd2) metadata writes driven mostly by journald's continuous appends. node4 + node5 had drifted to uncapped persistent journald (4 GB each, ~100 KB/s); master/node1-3 were correctly capped at 500M. Node + pod journals already ship to Loki (alloy loki.source.journal), so on-disk journald is pure write-IOPS overhead on the IOPS-bound sdc. Switch journald to Storage=volatile (RAM, RuntimeMaxUse=200M) fleet-wide: - cloud_init.yaml: drop-in 90-oflt-volatile.conf for new nodes (replaces the old persistent seds). - running nodes (master + node1-5): pushed the same drop-in via qm guest exec + journald restart + cleared /var/log/journal. Verified node5: OS-disk writers jbd2/sda1-8 931->46 KB/s, systemd-journal gone (~94% drop); ~4 GB freed each on node4/node5. Logs stay queryable in Loki. Trade-off: a hard crash loses the last unshipped journal. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 08:15:38 +00:00
Viktor Barzin	74819d4061	feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:57:40 +00:00
Viktor Barzin	1afe41880e	docs: MySQL buffer-pool/limit + nextcloud webcal throttle; VCT drift fixed All checks were successful ci/woodpecker/push/default Pipeline was successful Details Reflect the code-oflt MySQL write-reduction work (commit `82c9e69b` + the nextcloud webcal app-data throttle): - MySQL row: buffer pool 1->2Gi, mem limit 4->6Gi, and the nextcloud webcal calendar churn that was ~60% of MySQL's writes (now throttled in oc_calendarsubscriptions.refreshrate — app-data, can regress). - CNPG apply-gotcha note: the mysql_standalone VCT-annotation drift no longer needs -target dodging (now ignore_changes'd on the STS VCT). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:56:04 +00:00
Viktor Barzin	82c9e69b77	dbaas/mysql: 2Gi InnoDB buffer pool + 6Gi limit + ignore VCT drift Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt). Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live attribution found ~60% of its writes were nextcloud webcal calendar churn (throttled separately at the app layer); this addresses write amplification on the remainder: - innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free page -> constant flush-to-make-room write IOPS). - container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the headroom. One-time MySQL pod restart to apply. - ignore_changes on the StatefulSet volume_claim_template: the VCT is immutable post-creation and pvc-autoresizer rewrites its annotations on the live object, so TF's desired VCT could never apply and errored every broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the long-standing need to -target around it. Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy, 24 DBs reachable, restart clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:55:18 +00:00
Viktor Barzin	29bf275cef	state(dbaas): update encrypted state	2026-06-30 07:53:48 +00:00
Viktor Barzin	308a174ad6	docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation All checks were successful ci/woodpecker/push/default Pipeline was successful Details PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP (10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc still listed only four IPs in use / three dedicated. Add the .204 row to the allocation table, bump the counts (five in use, four dedicated, 5-IP layout), and add a LB-IP renumber-checklist entry for the out-of-band consumers (the go2rtc WebRTC candidate on the frigate config PVC and the HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE candidates, so the Service annotation is the single source of truth. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:42:27 +00:00
ebarzin	469cdd7507	frigate: expose go2rtc on a dedicated MetalLB LB IP (RTSP 8554 + WebRTC 8555) All checks were successful ci/woodpecker/push/default Pipeline was successful Details HA live video from the cluster Frigate hangs/fails because the only path to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203), which cannot carry RTSP or WebRTC. The container already listens on 8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 + WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN endpoint for native HLS/WebRTC live (parity with the Hikvision NVR). Companion non-Terraform steps are in the PR body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:15:22 +00:00
Viktor Barzin	9ea9cae073	rightsize: reconcile batch-2/3 stacks blocked by killed #427 (job-hunter, wealthfolio, f1-stream) Some checks failed ci/woodpecker/push/default Pipeline failed Details Memory limits were committed (batch 2/3) but pipeline #427 was killed mid-apply and the local homelab tf apply hit a stale backend-init; this comment-only diff re-triggers a clean CI apply for the three stacks so live matches master (job-hunter 768Mi, wealthfolio 512Mi, f1-stream 384Mi). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:59:41 +00:00
Viktor Barzin	7cc9cde5b1	external-secrets: enable ESO Vault token cache to cut sdc write churn All checks were successful ci/woodpecker/push/default Pipeline was successful Details Add --enable-vault-token-cache to the ESO controller (a graduated, non-experimental flag in chart 2.6.0). Until now ESO authenticated to Vault with login -> lookup-self -> revoke-self on every secret fetch. Across 92 ExternalSecrets refreshing every 15m that measured ~0.22 logins/s + ~0.22 revoke-self/s on the active Vault member, and each cycle is a token create+revoke (plus its lease) written to the Raft log on all three members. Those fsync-heavy writes land on the contended PVE RAID1 7200rpm HDD (sdc) -- one of the write sources behind the recurring control-plane flaps (code-oflt write-reduction). The eso kubernetes-auth role already issues a 240h periodic, unlimited- use token, so the churn was pure waste: ESO discarded a perfectly good token after a single use. With token caching ESO mints one token and reuses/renews it, collapsing logins from ~13/min to a handful per token lifetime. Verified live: vault cache initialized, 112/113 ExternalSecrets Ready (the one failure, instagram-poster, is pre-existing data drift unrelated to auth), logins dropped to ~0 after warm-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:32:37 +00:00
Viktor Barzin	5e384ed762	state(external-secrets): update encrypted state	2026-06-29 15:32:37 +00:00
Viktor Barzin	bc626a2d89	rightsize: raise OOM-tight memory limits (batch 3/N — spike protection) Some checks failed ci/woodpecker/push/default Pipeline failed Details shlink 512->704Mi, linkwarden 1Gi->1280Mi, chrome-service 2Gi->2624Mi, forgejo 4Gi->5Gi, f1-stream 256->384Mi. All were request==limit with 30d peak at 91-100% of the ceiling — a spike would OOM-kill them. Raising the limit (now Burstable, request<limit) gives real burst headroom. This is the genuine 'don't OOM on occasional spike' fix. Small add (~2.2Gi limits) vs the ~20Gi of fat removed in batches 1-2, so net overcommit keeps dropping. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:28:11 +00:00
Viktor Barzin	418d1efb4b	rightsize: trim over-provisioned memory (batch 2/N) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details claude-agent-service 12Gi->3Gi (peak 585Mi — the single biggest fat, ~9Gi of limit-overcommit removed), job-hunter 1280->768Mi (kept chromium headroom; 30d peak 118Mi), fire-planner 1024->320Mi, wealthfolio 1Gi->512Mi (kept history-growth headroom). Burstable, limits kept >= generous peak headroom, never below peak. ~10.7Gi of limit overcommit removed. paperless-ai intentionally LEFT at 4Gi (documented in-process RAG model load). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:27:17 +00:00
Viktor Barzin	a3f2c2947a	docs: refresh CNPG tuning note (archive_timeout=0, commit_delay, zstd) + apply gotcha All checks were successful ci/woodpecker/push/default Pipeline was successful Details Reflects the write-reduction params applied in `c3553731`, and documents the null_resource trigger-bump + targeted-apply gotcha so the next agent doesn't hit the inert-change / mysql-VCT-drift traps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:17:38 +00:00
Viktor Barzin	ec04963bfe	state(dbaas): update encrypted state Some checks failed ci/woodpecker/push/default Pipeline was canceled Details	2026-06-29 15:16:50 +00:00
Viktor Barzin	c3553731c7	dbaas: CNPG write-reduction — archive_timeout=0, commit_delay, wal_compression=zstd Part of code-oflt (cut sdc write IOPS before the SSD move; analysis #6922). - archive_timeout 300->0: CNPG forces archive_mode=on but .spec.backup is empty (no ObjectStore), so a 16MB WAL segment switch every 5min shipped NOWHERE = ~4.6 GB/day of pure-waste WAL on the contended sdc. archive_mode stays CNPG-on (reserved); 0 just stops the timed switch. Daily pg_dump cron unchanged. - commit_delay 0->2500us: group-commit coalesces concurrent fsyncs. SAFE for every DB incl financial -- data still fsynced before COMMIT acks, only <=2.5ms added latency under concurrency. - wal_compression pglz->zstd: ~30-50% smaller full-page images. All sighup-reloadable. Applied via targeted apply of module.dbaas.null_resource.pg_cluster (trigger bumped) to avoid the pre-existing mysql VCT drift that breaks broad dbaas applies. Refs: code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:16:38 +00:00
Viktor Barzin	5d059786a1	rightsize: trim over-provisioned memory limits+requests (batch 1/N) All checks were successful ci/woodpecker/push/default Pipeline was successful Details claude-breakglass 4Gi->512Mi, stirling-pdf 1536->512Mi, insta2spotify 2Gi->256Mi, recruiter-responder 768->256Mi. These idle/utility services had memory LIMITS sitting 4-15x above their 30d peak, inflating cluster limit-overcommit to 142% across the 5 post-node6 nodes. Burstable (request<limit), limits capped at ~peak x1.5 (never below peak), so no OOM risk (verified zero OOMKills cluster-wide in 30d). Reduces phantom limit overcommit + frees scheduler requests. Follows the 3-reviewer adversarial review: raising limits on an already-overcommitted cluster worsens correlated node-OOM; the real fix is trimming the fat. Limits only lowered where peak is far below; tuned/DB/GPU limits untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 14:46:58 +00:00
Viktor Barzin	4473b469e3	lvm-pvc-snapshot: cut retention 7->3 days (reduce sdc thin-pool CoW IOPS + free ~1TB) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Part of the sdc IOPS-reduction work (code-oflt). 462 daily thin snapshots (66 PVCs x 7d) drive ~10-34 w/s of thin-pool metadata (tmeta) CoW writes on the contended sdc spindle and pin ~2TB in the 70%-full pool. Halving to 3 days roughly halves both. Instant-restore window shrinks 7->3d; daily-backup still keeps 4 weeks of file-level PVC history, so DR coverage is unchanged. Deployed to the PVE host via scp (these host scripts are scp-deployed, not TF-managed). Doc updated in .claude/CLAUDE.md. Refs: code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:59:16 +00:00
Viktor Barzin	256122ff5b	monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic All checks were successful ci/woodpecker/push/default Pipeline was successful Details The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live). Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:34:01 +00:00
Viktor Barzin	6c3619c9c6	state(dbaas): update encrypted state All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-29 12:26:21 +00:00
Viktor Barzin	682b982c78	state(dbaas): update encrypted state	2026-06-29 12:25:53 +00:00
Viktor Barzin	c0e0911afa	dbaas: bump pg_cluster trigger so the checkpoint/WAL params actually apply `a2c8f906` added checkpoint_timeout=15min + max/min_wal_size to the CNPG Cluster YAML, but the cluster is applied via null_resource.pg_cluster + local-exec kubectl apply, which only re-runs when its `triggers` change. The YAML edit didn't bump a trigger, so the change was inert and never applied (incl. via CI). Bump the pg_params trigger so the kubectl apply re-runs and CNPG hot-reloads the new params (reloadable, no restart). Landing it via a targeted apply (-target=null_resource.pg_cluster) to avoid 3 pre-existing unrelated drifts in this stack -- notably a mysql_standalone volumeClaimTemplate annotation diff the apiserver rejects as immutable, which is what fails broad dbaas applies (and silently blocked `a2c8f906`). Refs: code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:25:37 +00:00
Viktor Barzin	bebe8fbd74	workflows: add read-only memory-overcommit + node-removal capacity analysis All checks were successful ci/woodpecker/push/default Pipeline was successful Details Reusable Workflow script that audits whether the cluster is memory-overcommitted and whether a single k8s worker can be removed to return RAM to the PVE host without sacrificing N-1 failover. Read-only throughout: gathers PVE host memory (qm config / free / KSM via SSH), k8s per-node capacity + cluster 30d peak working set, and per-workload right-sizing, then models N-1 two ways (physical actual-usage and scheduling-by-request) and adversarially verifies the conclusion with 3 skeptics. Sizes requests (scheduling reservation) and limits (OOM ceiling) as SEPARATE knobs — an earlier ad-hoc pass conflated them by sizing requests to 30d peak, which manufactured a false N-1 shortfall. Invoke via Workflow {scriptPath}, or by name when cwd is the infra repo. Requested by Viktor: identify memory overcommit and whether deployment requests can be trimmed to free PVE host RAM by removing a node, without sacrificing service reliability. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:06:17 +00:00
Viktor Barzin	a2c8f906ec	dbaas: stretch CNPG checkpoint timer 5->15min + raise WAL size (cut sdc write IOPS) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each triggering a full-page-write burst + flush onto the contended 7200rpm sdc spindle -- a top write-IOPS source after etcd. Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled rather than churned. All three are sighup-reloadable -> CNPG applies them without a restart or failover. checkpoint_completion_target stays 0.9 so each checkpoint's IO is still smeared across the interval. Bounded recovery-time tradeoff (more WAL to replay on crash), acceptable for the write relief. wal_compression left at pglz ('on') pending image zstd-support verification. Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB / 2560MB / 3Gi). Refs: code-oflt (etcd/sdc IO isolation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 11:41:09 +00:00
Viktor Barzin	3398873a16	k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security) uptake now lags up to 7 days instead of <=1. - var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC) - var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h after the Sunday check, so nightly-report.py's ~25h staleness threshold stays valid AND still flags a missed weekly run; no STALE_SECONDS change needed) The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename = churn). Cadence wording updated across main.tf comments, nightly-report.py docstring, and the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 06:22:20 +00:00
Viktor Barzin	e43e64c666	kyverno: disable reports-controller to stop etcd ephemeralreport load All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd writes if etcd moves there. Investigation found the avoidable load is kyverno reporting: the 2026-06-12 etcd-load-reduction disabled the report features but left the reports-controller running (default --enableReporting + --validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in etcd) that nothing reaps (aggregation off). Listing that range starves etcd's fdatasync enough to flap the apiserver (observed live 2026-06-28). Disable the reports-controller outright (reportsController.enabled=false), completing the 2026-06-12 intent. Reports are not consumed (violations surface via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are independent of it. The ~10.5k stale reports already in etcd are cleared separately (throttled, out-of-band) since bulk-deleting them is itself etcd-heavy. Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 05:35:36 +00:00
Viktor Barzin	cf42042cba	monitoring: re-trigger apply to persist state after CI cancel-race All checks were successful ci/woodpecker/push/default Pipeline was successful Details No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`. The pfSense egress-monitoring apply (commit `7fe2d978`, CI pipeline #414) was cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources applied (probes green, rules loaded) but the Terraform state write and the helm release finalize were lost, leaving the prometheus release stuck in pending-upgrade (manually unstuck). This commit re-applies the unchanged monitoring stack so state matches live, with zero resource changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:58:49 +00:00
Viktor Barzin	f92075b7c5	fire-planner: solve FIRE targets to age 100 (horizon 60→72) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor plans to live to 100, so the portfolio must last that long. The fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72 (retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years to fund). A one-off in-cluster job re-solves the existing rows at the new horizon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:49:20 +00:00
Viktor Barzin	7fe2d9780e	monitoring: add pfSense WAN/egress alerting + probes Some checks failed ci/woodpecker/push/default Pipeline was canceled Details On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for ~20 min while internal routing + Unbound stayed up; recovery needed a manual reboot and NOTHING alerted — there was no egress probe and the cloudflared replica metric stayed green. Add first-class egress monitoring so the next occurrence pages in ~2 min instead of being noticed by a human. - blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW so ICMP can use raw sockets). - Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 + 1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers). - Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable, InternetEgressDown (both providers dead), ExternalDNSResolutionDown, EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's exact "external down while internal up" signature), PfSenseVMDown. - Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the cloudflared replica metric is blind to tunnel-connection loss. Threshold calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident). - Alertmanager inhibit: WAN/egress-down suppresses the downstream egress symptom alerts so one root alert pages, not a storm. - Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md. All metric names + the cloudflared threshold verified against live Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening (dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred and documented in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:46:30 +00:00

1 2 3 4 5 ...

4693 commits