infra

Author	SHA1	Message	Date
Viktor Barzin	0472f67d49	t3code: devvm dispatch + auto-pair service (Go) Routes X-authentik-username -> per-user t3 instance; on no t3_session cookie, mints a pairing token (as the OS user) and exchanges it at /api/auth/bootstrap, injecting the session cookie. Listens :3780, reads /etc/t3-serve/dispatch.json. Constants from the Task-1 auth-contract spike. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	72aba7da32	t3code: reconcile per-user t3 instances from /etc/ttyd-user-map Sticky port allocation (3773+), enables t3-serve@<user>, emits /etc/t3-serve/dispatch.json for the dispatch service. systemd timer (OnBootSec+hourly) mirrors the apply-mbps-caps pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	f8a63fdacd	t3code: per-user t3-serve@ systemd template (User=%i file isolation) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 19:24:30 +00:00
Viktor Barzin	2152430b70	docs(t3code): record discovered t3 web-auth contract	2026-06-02 19:24:30 +00:00
Viktor Barzin	5e4f83d4e7	wealth: consolidation chunk 1 — merge NW/contribution/growth, returns table, yearly combo 36 -> 19 panels (chunk 1 of 2), zero metric loss: - 3 NW/contribution/growth timeseries -> 1 "contribution vs market value (+growth)" - 11 returns/Δ stat cards (12mo x3 + Δ 1d/7d/30d/90d all&mkt) -> 1 "Returns over time windows" table (window × Δall/Δmkt/return%) - 2 yearly barcharts -> 1 combo (contributions/market-gain bars + return-% line, timeFrom=10y so full history always shows) All SQL validated live. Chunk 2 (net-pay $grain merge, projection->Trend panel, row reorg) to follow. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 22:27:09 +00:00
Viktor Barzin	a09b0b3612	docs(t3code): implementation plan for per-user auto-provisioning Task-by-task plan pairing with the design doc: Task 1 discovers the t3 web-auth contract (cookie name + bootstrap body), then systemd template, reconcile, devvm dispatch+auto-pair Go service, scoped sudoers, TF repoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 22:19:22 +00:00
Viktor Barzin	1a0647c7ed	docs(t3code): design for per-user auto-provisioning (Authentik login → instance + session) Approach 1: /etc/ttyd-user-map as source of truth; per-user t3-serve@.service template (User=%i enforces file permissions); devvm reconcile; devvm dispatch+auto-pair service (mints + injects the t3 session cookie on first authenticated visit, replacing the in-cluster nginx). Spec for review before writing the implementation plan. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 22:10:05 +00:00
Viktor Barzin	55ed50b932	docs(plans): wealth dashboard consolidation design Consolidate the wealth Grafana dashboard 36 -> ~17 panels with zero metric loss: merge the 3 NW/contribution/growth timeseries into 1, the 11 returns/Δ stat cards into 1 returns table, the 2 yearly barcharts into 1 combo, and the 3 net-pay-vs-market-gain panels into 1 (grain dropdown); reorganize into collapsed rows. Also rebuild the projection as a Trend panel (numeric years-from-today x-axis) so it renders regardless of the dashboard time range (fixes empty-by-default). Philosophy: merge duplicates, keep every metric. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:52:59 +00:00
Viktor Barzin	73cb0aab8b	t3code: per-user isolation via Authentik + nginx username dispatcher t3 is single-owner (no in-app multi-user), so each person runs their own `t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service), emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the Authentik-injected X-authentik-username to the right instance; unmapped identities get 403 (no shared fallback). Flipped the ingress auth app→required (Authentik forward-auth) — the same-origin self-served UI works behind it (WS carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate. Mirrors the terminal stack's per-user model. Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403; t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes intentionally unsupported here — deferred until the native app is published. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:38:06 +00:00
Viktor Barzin	9fb3e6e851	docs: correct cloudflared-502 post-mortem + fix stale .200 Traefik ref [ci skip] Real root cause of the 2026-06-01 full-site 502 was not a missed reference but an out-of-band fix that Terraform reverted: the 2026-05-30 Traefik .200->.203 migration repointed the Cloudflare tunnel to the Traefik service DNS via the CF Global API Key, but never landed that change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01 reconciled live back to the stale .200, breaking all external ingress. Rewrite the post-mortem around the "codify out-of-band fixes or TF reverts them" lesson (a Terraform-Only-rule violation). Also fix docs/runbooks/kms-public-exposure.md, which still claimed Traefik served on 10.0.20.200:443 (now .203) — same migration fallout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 21:25:33 +00:00
Viktor Barzin	f807050eb5	cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip] The Cloudflare tunnel routed *.viktorbarzin.me and the apex to https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200 onto its dedicated 10.0.20.203 on 2026-05-30 (commit `0c01adac`). Nothing serves HTTPS on .200:443 anymore, so cloudflared could not reach its origin (no route to host / i/o timeout) and Cloudflare returned 502 for every externally-proxied service. Internal/LAN access (split-horizon -> .203) was unaffected, which masked the outage. Repoint both ingress rules at the in-cluster Traefik Service DNS (https://traefik.traefik.svc.cluster.local:443) -- the design the docs already described but the code never implemented -- so the tunnel is decoupled from the Traefik LB IP and this cannot recur on a future move. Applied live via targeted apply on the tunnel config resource only; [ci skip] because live already matches and a full stack apply would churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk). Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00
Viktor Barzin	30a644d3cd	docs(kms): document reboot-after-uninstall / 1603 handling + real-hardware status The bundled consumer Office removal leaves a pending reboot; a same-run VL install (or re-run before rebooting) fails with setup.exe 1603. Document the two guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and the on-disk completion poll. Record that the uninstall path is now verified on a real M365 box (O365HomePremRetail removed) and the install needs a reboot first.	2026-06-01 21:22:05 +00:00
Viktor Barzin	a382683c0e	infra: fix containerd forgejo-registry redirect .200->.203 (+skip_verify) Traefik moved off shared .200 to its dedicated .203 on 2026-05-30, but the containerd hosts.toml redirect for forgejo.viktorbarzin.me still pointed at the now-dead .200:443 -> every FRESH forgejo pull failed (cached images kept running, so it stayed hidden until a new image tag was pulled). Retarget to .203 and add skip_verify (node dials Traefik by IP; cert is for forgejo.viktorbarzin.me) in both the new-node cloud-init and existing-node deploy scripts. Already rolled to all 7 nodes (rewrite + restart containerd, no drain). Doc fix in .claude/CLAUDE.md.	2026-06-01 21:22:05 +00:00
Viktor Barzin	82855848d1	plans: TopoLVM migration evaluation (Path 3 for LUN-cap relief) Decision-support doc, NOT a commitment. Evaluates whether replacing proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling permanently and at what cost. Key trade-off documented: TopoLVM PVCs are pinned to the node where the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs migrate between VMs when pods reschedule. The data-locality penalty matters most for single-replica stateful services (MySQL standalone, Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft) absorb it. Three disk-layout options: A. Carve per-VM data disks from sdc — simple, no hardware, IO contention unchanged B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free C. Add a dedicated NVMe — also closes beads code-oflt (IO contention), ~£200 hardware investment Effort estimate: 2.5-3 weeks of focused work for the full migration; covers TopoLVM install, lvmd config, per-VM disk provisioning, LUKS plumbing, 5 migration waves (regenerable → huge PVCs), backup-pipeline rewrite, deprecation. Recommended next step before committing: small pilot on k8s-node5/6 with one non-critical PVC to validate the operational pattern end-to-end. Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap, docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative), beads code-oflt (IO isolation).	2026-06-01 21:22:05 +00:00
Viktor Barzin	599d67db51	docs(kms): self-hosted ODT bootstrapper + anonymous client telemetry (kms-diag/Loki) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 21:22:05 +00:00
Viktor Barzin	f364399ede	wealth: add 30y net-worth projection row + align net-pay panel Implements the committed projections design (docs/plans/2026-05-28-wealth- projections-{design,plan}.md): a collapsed "Projections" row on the wealth dashboard with 5 template vars (rate_low/base/high, monthly_contribution=auto, horizon_years=30), a multi-scenario projection panel (Low/Base/High + trailing- 3y historical line + a base-rate compounding-only line), 3 stat cards, and a text panel with one-click future time-range links. Projection is pure SQL over dav_corrected: compound + ordinary-annuity FV from today's net worth; auto contribution = trailing-12mo run-rate (COALESCE/NULLIF so $monthly_contribution=auto doesn't constant-fold 'auto'::numeric). Historical rate = trailing-3-full-year geometric mean of per-year Modified-Dietz returns (~10.4%) — all-time was a nonsense 83% because the all-accounts-complete window is only ~4 months, and the true all-time geomean is skewed by 2021's +86%. Also aligns "Net pay vs market gain — per month" to consecutive month-end deltas (same fix as the other monthly panels). Verified all SQL live. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	32e1042ca8	t3code: expose `t3 serve` (DevVM) publicly at t3.viktorbarzin.me (app-tier) New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints → 10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied, auth="app"). t3 ships its own owner-pairing + bearer-session auth, so Authentik forward-auth is intentionally omitted — it would break the cross-origin native mobile app and app.t3.codes (bearer-only, no Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier) rate-limit the public surface; t3's pairing is the gate. TLS is auto-synced into the namespace by Kyverno's sync-tls-secret policy. Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200. Trade-off (public RCE surface behind app-native auth, no Authentik SSO) accepted 2026-06-01 to keep the native app + app.t3.codes working. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	c5e4b1ea71	kms: add /diag anonymous telemetry collector behind Anubis carve-out The PowerShell activation scripts POST small JSON diagnostics to /diag so script execution errors are captured. The collector (python:3.12-alpine, ConfigMap-mounted) prints each event to stdout as a KMSDIAG line; the cluster's Loki scrapes pod stdout, making events searchable in Grafana (Loki only — no Slack, no Prometheus). Like /scripts, /diag needs a second ingress_factory carve-out with full_host="kms.viktorbarzin.me" so it bypasses the Anubis PoW challenge that PowerShell/curl can't solve. Without full_host the factory would derive kms-diag.viktorbarzin.me and the carve-out would never match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	3fa9e2409c	runbook: K8s worker scaling for PVC capacity headroom Documents the 6-worker cluster shape (post 2026-05-26 scale-up after the proxmox-csi LUN-cap incident), the six binding constraints (plugin LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration on node1, PVE host memory, no Terraform management for K8s VMs), and the playbooks for adding/removing workers. Scale-up triggers: - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days - cluster memory requests > 90% - LUN-cap incident - planned ≥3 net-new block PVCs when max VA already ≥ 22 Scale-down conditions: - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days Playbooks lean on scripts/provision-k8s-worker (clones template 2000, cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete node → qm shutdown for removes. Cold-spare option documented. Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap, docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md, beads code-oflt (IO contention long-term fix).	2026-06-01 19:50:41 +00:00
Viktor Barzin	5c77482a8c	fire-planner: LLM_MODEL env var → qwen3vl-4b default (fits in current GPU headroom; immich-ml is holding ~10GB)	2026-06-01 19:50:41 +00:00
Viktor Barzin	fb1e47a20a	nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9 bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around both failure modes: - F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true; Job deadline bumped 120->600s so it isn't killed mid-migration. - F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade CrashLoop): chart_values renders the live tag via a plural kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on fresh install/DR), so a re-render never downgrades below live. Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and its background-controller overrides a TF-set value, and patch == minor for Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the per-workload keel.sh/policy override resources to avoid perpetual drift; ns enrollment + Kyverno now own the keel annotations like other workloads. Also bumps the external-storage bootstrap Job create timeout 1m->12m to match its own 10m pod-wait, since Keel bumps now roll the pod mid-apply. Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.	2026-06-01 19:50:41 +00:00
Viktor Barzin	50d0f1affa	kyverno: strip orphaned keel.sh/match-tag fleet-wide (image-swap fix) The 2026-05-26 migration flipped the keel default force->patch and dropped match-tag from the inject-keel-annotations patch, but Kyverno's add-only mutate can't remove an annotation that's no longer listed -- 194 workloads kept a stale keel.sh/match-tag=true. Under it Keel cross-assigned images in multi-image pods: the blog's nginx<->nginx-exporter images were swapped and the site was down 2026-05-26 -> 06-01 (nginx received the exporter's -nginx.scrape-uri arg and CrashLoopBackOff'd); changedetection was silently swapped (app lost its /datastore PVC + env, ran ephemeral for days). - policy now sets keel.sh/match-tag=null (strips on admission, never re-added) - swept the annotation off all 194 existing workloads (kubectl, no pod restart) - AGENTS.md: documents the strip; post-mortem added blog + changedetection un-swapped via kubectl set image (TF-ignored images); both 2/2 and serving 200. Policy already applied via scripts/tg (Tier-1 PG state authoritative). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 19:50:41 +00:00
Viktor Barzin	769ae7a6d3	traefik: bot-block-proxy buffer 256k + document the real HTTP/2 limit Follow-up to the 64k bump: raised bot-block-proxy large_client_header_buffers to 256k and corrected the rationale. Investigation found the binding limit for browsers is Traefik's HTTP/2 header cap (~64KB, Go maxHeaderListSize, not exposed by Traefik config) — oversized authentik_proxy_* cookie piles are rejected at the h2 layer upstream of bot-block regardless of these buffers. The real fix for >64KB piles is reducing authentik_proxy_* cookie accumulation (or clearing cookies); these buffers only prevent bot-block being a tighter bottleneck for sub-64KB piles + HTTP/1.1 clients. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:27 +00:00
Viktor Barzin	1c165ce5b4	docs(kms): document the consequence-gated edition switch (changepk + ODT) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:27 +00:00
Viktor Barzin	3d28870e25	nextcloud: fix backup retention to sort by name, not mtime The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used `ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's mtime, so the freshest backup didn't sort as newest — the retention step deleted the new backup and kept a stale one. Sort lexically (chronological for these names) and keep the last. Also exclude html/ (the app code, reproducible from the now-pinned image; the real config lives at config/config.php, html/config is empty) so the backup is config+data+custom_apps only → ~4.3G (<5G target). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
root	84ab4c998c	Woodpecker CI deploy [CI SKIP]	2026-06-01 15:15:26 +00:00
Viktor Barzin	ddd582a28c	backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image The offsite Synology hit 97% — the Backup share grew +670G in a week, traced to the 2026-05-26 change that began mirroring large regenerable services offsite, plus an unbounded nextcloud.log bloating its backups to 87G. - nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies). - offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the SSD no longer ship offsite (re-pullable models). - daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption PV, still backed up weekly). - nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup now excludes html/ (app code, from image), logs, and preview cache and keeps only the latest copy (pvc-data holds version history) → <5G (was 87G). - nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to 32.0.3; reconciling that drift this session rolled a 32.0.3 pod that CrashLooped on the downgrade. Pinning eliminates the drift. Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	0dd4a31eff	docs(immich): cap server-side job concurrency to protect sdc + log recurrence A library-wide Duplicate Detection run on 2026-06-01 fanned the ML/thumbnail backfill out at thumbnailGeneration concurrency 8, saturating the shared sdc HDD and starving etcd -> kube-apiserver down ~30 min (5th IO-pressure incident on sdc). Capped server-side thumbnailGeneration/metadataExtraction/library to 2 in the Immich DB system-config; documented in the Immich row and recorded the recurrence + still-TODO IO-isolation fixes in the 2026-05-25 post-mortem (this also commits that previously-untracked post-mortem). [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 15:15:26 +00:00
Viktor Barzin	af4bfbe046	kms: revert files accidentally bundled into the docs commit The previous commit (81a7d804) swept in 23 unrelated working-tree files because a rebase --autostash had left them staged in the index — including 4 files with leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf, url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock + the llama-cpp markers) to their prior committed state; terragrunt regenerates the generated files on the next run. Net effect of the docs commit is now just the runbook doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	bdb0cef242	docs(kms): document /keys.json carve-out + script auto-key selection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	170a3bb052	traefik: bump bot-block-proxy large_client_header_buffers to 8x64k The ai-bot-block forward-auth copies the full request (incl. the accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy. With 30+ Authentik Proxy Providers under viktorbarzin.me the combined Cookie header exceeds openresty's default 4x8k buffers, so the auth check returned 400 "Request Header Or Cookie Too Large" (surfaced as error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo OAuth sign-in for affected browsers. Mirror the existing auth-proxy-config fix: 8x64k accepts the pile. Applied live via tg apply + bot-block-proxy rollout restart. [ci skip] Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
Viktor Barzin	6f0bdf2993	kms: carve /keys.json out of Anubis for script auto-key-selection The activation scripts now fetch the published GVLK list from /keys.json to auto-select the right key for the detected edition. Like the .ps1 scripts, that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the PoW). Add /keys.json to the ingress_scripts carve-out path list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
root	7a297deb24	Woodpecker CI deploy [CI SKIP]	2026-06-01 10:36:49 +00:00
Viktor Barzin	e63a812062	kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203), which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688 failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts -> bare kms-web-page service) so `iwr \| iex` downloads the real script instead of the PoW challenge HTML. Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202). Docs: kms-public-exposure runbook + service-catalog entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-01 10:36:49 +00:00
root	de04ed099e	Woodpecker CI Update TLS Certificates Commit	2026-06-01 10:36:49 +00:00
Viktor Barzin	e5d9160a88	monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip] goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify daemonset were missed by the `cdb7d9a8` KEEL_LIFECYCLE sweep. The monitoring ns is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh annotations; TF kept trying to revert both, plus a live-stamped tier label — which made `terragrunt plan -detailed-exitcode` return 2 every run and the drift-detection cron fail daily. Add the standard KEEL ignore_changes (image + keel.sh annotations) and ignore the tier label so these stop churning. Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this does not trigger a monitoring apply. Remaining (separate) drift: the grafana ACL null_resource (triggers.always) + tls cert refresh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:33:30 +00:00
Viktor Barzin	935fb07df7	hermes-agent: gate PVC on parked flag (clears PVCStuckPending) The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at replicas=0 it had no consumer pod and sat Pending forever, falsely tripping PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to drive both replicas and the PVC count, so a parked service has no PVC at all. Empty/never-bound PVC removed; recreated automatically when un-parked. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:19:28 +00:00
Viktor Barzin	7b6a0e70af	hermes-agent: opt out of external monitor while parked hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence firing, which halts kured node reboots. Set external_monitor=false so a deliberately-down service stops tripping the divergence gate. Re-enable when the deployment is brought back up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 15:12:33 +00:00
Viktor Barzin	51313ee088	kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating: 0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle, and the immortal bash loop slowly leaks (kubectl forks + Check-4 process substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so the pod never restarts — just silent oom_events. Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak can never accumulate, regardless of how long a node stays pending-reboot. Docs: post-mortem + automated-upgrades.md gate note. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-31 14:49:04 +00:00
Viktor Barzin	0c64fc2948	travel-agent: switch from Slack webhook to bot token (chat.postMessage)	2026-05-30 22:44:11 +00:00
Viktor Barzin	46f63bb70e	infra: travel-agent stack (namespace + ExternalSecret + 2 CronJobs)	2026-05-30 18:24:13 +00:00
Viktor Barzin	e1ab23193d	redis: revert 3-node Sentinel HA to single standalone instance [ci skip] The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network partition, hit the init script's deterministic "pod-0 = bootstrap master" fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2. HAProxy's `expect rstring role:master` matched both and round-robined client connections across the two diverging masters, so Immich enqueued BullMQ jobs on one while its workers blocked-popped on the other -> every queue wedged and new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6 weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade). Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy + init bootstrap configmap + both PDBs; redis container only (+ exporter). maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved). Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop. Docs: rewrite databases.md Redis section (single-instance design + incident history); add post-mortem 2026-05-30-redis-split-brain.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:49:43 +00:00
Viktor Barzin	5bcb4525a4	traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip] Large Immich video downloads and uploads failed at a hard ~60s wall. The websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps on total request/response duration, so every transfer slower than 60s was cut mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s with an HTTP/2 stream reset. - writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance assumes): unlimited download size/duration. - readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop (Immich has no resumable upload, so the window must exceed real upload times). Verified: the same 650MB download now completes fully (650MB / 102s, exit 0). IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source + docs only, hence [ci skip]. Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting), .claude/CLAUDE.md networking note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 17:46:59 +00:00
Viktor Barzin	89561c7779	technitium: complete Traefik .200->.203 migration for the .lan zone [ci skip] Today's Traefik dedicated-IP migration (.200 -> .203, ETP=Local) updated the viktorbarzin.me zone but missed the viktorbarzin.lan zone + two stale .200 literals — breaking every *.viktorbarzin.lan ingress host (internal exporters + ~15 HA-Sofia sensors via idrac-redfish/nvidia/snmp) and tripping the apex-drift probe. Found via /cluster-health (23 alerts -> 7). - apex-probe EXPECTED .200 -> .203 (apex IS .203; probe asserted the wrong value -> false ViktorBarzinApexDrift "critical"). - split-horizon externalToInternalTranslation .200 -> .203 (sofia-lan hairpin-NAT target). - ingress-dns-sync CronJob now also pins ingress.viktorbarzin.lan A to the LIVE Traefik LB IP (queried from svc/traefik) every run, so a future Traefik IP move can't silently break the .lan zone again. Added services get/list to its ClusterRole. Applied via targeted apply (4 resources, 0 destroyed) + manual CronJob triggers; verified apex correct=1 and the .lan anchor self-pins to .203. [ci skip] because a full technitium apply would also pick up unrelated pre-existing deployment drift (DNS pod restart risk) — left untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 16:54:09 +00:00
Viktor Barzin	a222c024fd	docs: correct tripit DNS classification to proxied [ci skip] tripit's ingress is dns_type="proxied" (Cloudflare), not non-proxied. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 15:00:49 +00:00
Viktor Barzin	b78378eda9	docs: catalog tripit service (service-catalog + databases) [ci skip] Add tripit (self-hosted TripIt-clone travel-itinerary PWA) to the service catalog Optional tier and Non-Proxied DNS list, and to the CNPG consumer + PostgreSQL rotation lists in the databases doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 14:59:01 +00:00
Viktor Barzin	c2b820dc55	postiz: adopt drifted resources into TF state; exclude stuck Helm release The 2026-05-24 apply was interrupted with the Helm release stuck in pending-install, leaving only 2 of ~12 resources in TF state (any apply errored "already exists"). Adopted the live resources back via import {} sweep (namespace, tls-secret, uploads PVC, ESO ExternalSecret, both ingresses, temporal Service, nfs backup PV+PVC) — plan now reaches zero. Reconciled code to live reality (zero runtime change to running postiz): - Removed kubernetes_deployment.temporal + kubernetes_job.temporal_search_ attr_cleanup: the temporal Deployment is gone from the cluster (only the Service survives). Scheduled posts remain unavailable until temporal is restored; immediate posting works. - Removed helm_release.postiz from TF entirely: importing it would force a helm upgrade (provider can't match merged values to config) and the release is stuck pending-install. Left Helm-managed outside TF. - Removed keel.sh/enrolled=true from the namespace (postiz was opted out of Keel on 2026-05-29; this would have re-enrolled it on apply). - Backup CronJob now dumps only the `postiz` DB (temporal/temporal_visibility DBs don't exist) and no longer depends_on the removed helm_release. Applied: 9 imported, 1 added (backup CronJob), 6 changed (benign), 0 destroyed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 14:36:07 +00:00
Viktor Barzin	01351e4ce2	tripit: deploy stack + DB provisioning + ongoing mail-ingest [ci skip] - stacks/tripit: namespace, ESO (vault-kv + vault-database), Deployment (alembic init + app), Service, NFS document PVC, ingress (Authentik forward-auth) + /api/calendar carve-out (auth=none, HMAC-token gated), and 3 worker CronJobs. ingest-mail is live: real IMAP (me@, read-only BODY.PEEK, recent-30) + local LLM (qwen3vl-4b on llama-swap), idempotent (skips seen message_ids), owner me@viktorbarzin.me. - stacks/dbaas: create CNPG role+db `tripit`. - stacks/vault: pg-tripit static role (7d rotation) + allowed_roles entry. Deployed at tripit.viktorbarzin.me. [ci skip]: stacks were applied out-of-band via scripts/tg this session; a CI re-apply would also apply unrelated pre-existing dbaas/vault drift (MySQL StatefulSet, vault OIDC). Refs: code-bb9g, code-muqi Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 10:23:11 +00:00
Viktor Barzin	e9046e5a26	traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2 only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients (ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh (config.xml shellcmd), keeping the nginx-off-[::] patch. Also fixes stale networking.md: Traefik was still documented on the shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 09:51:23 +00:00
Viktor Barzin	16c9aafafa	docs: Traefik dedicated-IP + ETP=Local cutover SUCCEEDED (attempt 2) Records the successful cutover and the key fix that made it safe: decouple cloudflared from the LB IP first (point its tunnel ingress at the in-cluster Traefik Service), so moving Traefik 10.0.20.200 -> 10.0.20.203 no longer breaks proxied apps or Vault's ingress. Updates infra CLAUDE.md Networking notes with the new Traefik LB IP / ETP=Local / cloudflared->ClusterIP state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-30 08:12:57 +00:00

1 2 3 4 5 ...

3895 commits