Commit graph

3891 commits

Author SHA1 Message Date
Viktor Barzin
5e4f83d4e7 wealth: consolidation chunk 1 — merge NW/contribution/growth, returns table, yearly combo
36 -> 19 panels (chunk 1 of 2), zero metric loss:
- 3 NW/contribution/growth timeseries -> 1 "contribution vs market value (+growth)"
- 11 returns/Δ stat cards (12mo x3 + Δ 1d/7d/30d/90d all&mkt) -> 1 "Returns over
  time windows" table (window × Δall/Δmkt/return%)
- 2 yearly barcharts -> 1 combo (contributions/market-gain bars + return-% line,
  timeFrom=10y so full history always shows)

All SQL validated live. Chunk 2 (net-pay $grain merge, projection->Trend panel,
row reorg) to follow.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:27:09 +00:00
Viktor Barzin
a09b0b3612 docs(t3code): implementation plan for per-user auto-provisioning
Task-by-task plan pairing with the design doc: Task 1 discovers the t3
web-auth contract (cookie name + bootstrap body), then systemd template,
reconcile, devvm dispatch+auto-pair Go service, scoped sudoers, TF repoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:19:22 +00:00
Viktor Barzin
1a0647c7ed docs(t3code): design for per-user auto-provisioning (Authentik login → instance + session)
Approach 1: /etc/ttyd-user-map as source of truth; per-user t3-serve@.service
template (User=%i enforces file permissions); devvm reconcile; devvm
dispatch+auto-pair service (mints + injects the t3 session cookie on first
authenticated visit, replacing the in-cluster nginx). Spec for review before
writing the implementation plan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 22:10:05 +00:00
Viktor Barzin
55ed50b932 docs(plans): wealth dashboard consolidation design
Consolidate the wealth Grafana dashboard 36 -> ~17 panels with zero metric
loss: merge the 3 NW/contribution/growth timeseries into 1, the 11 returns/Δ
stat cards into 1 returns table, the 2 yearly barcharts into 1 combo, and the
3 net-pay-vs-market-gain panels into 1 (grain dropdown); reorganize into
collapsed rows. Also rebuild the projection as a Trend panel (numeric
years-from-today x-axis) so it renders regardless of the dashboard time range
(fixes empty-by-default). Philosophy: merge duplicates, keep every metric.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:52:59 +00:00
Viktor Barzin
73cb0aab8b t3code: per-user isolation via Authentik + nginx username dispatcher
t3 is single-owner (no in-app multi-user), so each person runs their own
`t3 serve` on the DevVM as their own OS user: wizard→:3773 (t3-serve.service),
emo→:3774 (t3-serve-emo.service). An in-cluster nginx `t3-dispatch` maps the
Authentik-injected X-authentik-username to the right instance; unmapped
identities get 403 (no shared fallback). Flipped the ingress auth app→required
(Authentik forward-auth) — the same-origin self-served UI works behind it (WS
carries the Authentik cookie) and t3's own pairing/bearer stays the inner gate.
Mirrors the terminal stack's per-user model.

Verified: dispatcher routes vbarzin→:3773, emil.barzin→:3774, unmapped→403;
t3.viktorbarzin.me now 302s to Authentik. Cross-origin native app / app.t3.codes
intentionally unsupported here — deferred until the native app is published.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:38:06 +00:00
Viktor Barzin
9fb3e6e851 docs: correct cloudflared-502 post-mortem + fix stale .200 Traefik ref [ci skip]
Real root cause of the 2026-06-01 full-site 502 was not a missed
reference but an out-of-band fix that Terraform reverted: the 2026-05-30
Traefik .200->.203 migration repointed the Cloudflare tunnel to the
Traefik service DNS via the CF Global API Key, but never landed that
change in cloudflare.tf (left at .200). A terragrunt apply on 2026-06-01
reconciled live back to the stale .200, breaking all external ingress.
Rewrite the post-mortem around the "codify out-of-band fixes or TF
reverts them" lesson (a Terraform-Only-rule violation).

Also fix docs/runbooks/kms-public-exposure.md, which still claimed
Traefik served on 10.0.20.200:443 (now .203) — same migration fallout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:25:33 +00:00
Viktor Barzin
f807050eb5 cloudflared: fix tunnel origin .200 -> Traefik svc DNS (full-site 502 outage) [ci skip]
The Cloudflare tunnel routed *.viktorbarzin.me and the apex to
https://10.0.20.200:443, but Traefik moved off the shared MetalLB .200
onto its dedicated 10.0.20.203 on 2026-05-30 (commit 0c01adac). Nothing
serves HTTPS on .200:443 anymore, so cloudflared could not reach its
origin (no route to host / i/o timeout) and Cloudflare returned 502 for
every externally-proxied service. Internal/LAN access (split-horizon ->
.203) was unaffected, which masked the outage.

Repoint both ingress rules at the in-cluster Traefik Service DNS
(https://traefik.traefik.svc.cluster.local:443) -- the design the docs
already described but the code never implemented -- so the tunnel is
decoupled from the Traefik LB IP and this cannot recur on a future move.

Applied live via targeted apply on the tunnel config resource only;
[ci skip] because live already matches and a full stack apply would
churn unrelated pre-existing drift (Keel annotations, DKIM re-chunk).

Post-mortem: docs/post-mortems/2026-06-01-cloudflared-stale-traefik-origin.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 21:22:05 +00:00
Viktor Barzin
30a644d3cd docs(kms): document reboot-after-uninstall / 1603 handling + real-hardware status
The bundled consumer Office removal leaves a pending reboot; a same-run VL
install (or re-run before rebooting) fails with setup.exe 1603. Document the two
guards (hard-reboot gate + reboot-aware 1603 message), the C2R-log capture, and
the on-disk completion poll. Record that the uninstall path is now verified on a
real M365 box (O365HomePremRetail removed) and the install needs a reboot first.
2026-06-01 21:22:05 +00:00
Viktor Barzin
a382683c0e infra: fix containerd forgejo-registry redirect .200->.203 (+skip_verify)
Traefik moved off shared .200 to its dedicated .203 on 2026-05-30, but the
containerd hosts.toml redirect for forgejo.viktorbarzin.me still pointed at the
now-dead .200:443 -> every FRESH forgejo pull failed (cached images kept
running, so it stayed hidden until a new image tag was pulled). Retarget to
.203 and add skip_verify (node dials Traefik by IP; cert is for
forgejo.viktorbarzin.me) in both the new-node cloud-init and existing-node
deploy scripts. Already rolled to all 7 nodes (rewrite + restart containerd,
no drain). Doc fix in .claude/CLAUDE.md.
2026-06-01 21:22:05 +00:00
Viktor Barzin
82855848d1 plans: TopoLVM migration evaluation (Path 3 for LUN-cap relief)
Decision-support doc, NOT a commitment. Evaluates whether replacing
proxmox-csi with TopoLVM would lift the per-VM 29-PVC ceiling
permanently and at what cost.

Key trade-off documented: TopoLVM PVCs are pinned to the node where
the LV lives (topology.topolvm.cybozu.com/node). proxmox-csi PVCs
migrate between VMs when pods reschedule. The data-locality penalty
matters most for single-replica stateful services (MySQL standalone,
Nextcloud, Vaultwarden, mailserver, claude-memory, ~30 SQLite-backed
apps); replicated services (CNPG PG cluster, Redis-v2, Vault Raft)
absorb it.

Three disk-layout options:
  A. Carve per-VM data disks from sdc — simple, no hardware,
     IO contention unchanged
  B. Hybrid SSD/HDD — SSD-constrained at 675 GiB free
  C. Add a dedicated NVMe — also closes beads code-oflt (IO
     contention), ~£200 hardware investment

Effort estimate: 2.5-3 weeks of focused work for the full migration;
covers TopoLVM install, lvmd config, per-VM disk provisioning,
LUKS plumbing, 5 migration waves (regenerable → huge PVCs),
backup-pipeline rewrite, deprecation.

Recommended next step before committing: small pilot on
k8s-node5/6 with one non-critical PVC to validate the operational
pattern end-to-end.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/runbooks/scale-k8s-cluster.md (Path 1+2 alternative),
beads code-oflt (IO isolation).
2026-06-01 21:22:05 +00:00
Viktor Barzin
599d67db51 docs(kms): self-hosted ODT bootstrapper + anonymous client telemetry (kms-diag/Loki)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 21:22:05 +00:00
Viktor Barzin
f364399ede wealth: add 30y net-worth projection row + align net-pay panel
Implements the committed projections design (docs/plans/2026-05-28-wealth-
projections-{design,plan}.md): a collapsed "Projections" row on the wealth
dashboard with 5 template vars (rate_low/base/high, monthly_contribution=auto,
horizon_years=30), a multi-scenario projection panel (Low/Base/High + trailing-
3y historical line + a base-rate compounding-only line), 3 stat cards, and a
text panel with one-click future time-range links.

Projection is pure SQL over dav_corrected: compound + ordinary-annuity FV from
today's net worth; auto contribution = trailing-12mo run-rate (COALESCE/NULLIF
so $monthly_contribution=auto doesn't constant-fold 'auto'::numeric). Historical
rate = trailing-3-full-year geometric mean of per-year Modified-Dietz returns
(~10.4%) — all-time was a nonsense 83% because the all-accounts-complete window
is only ~4 months, and the true all-time geomean is skewed by 2021's +86%.

Also aligns "Net pay vs market gain — per month" to consecutive month-end
deltas (same fix as the other monthly panels). Verified all SQL live.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
32e1042ca8 t3code: expose t3 serve (DevVM) publicly at t3.viktorbarzin.me (app-tier)
New stacks/t3code mirrors stacks/terminal: K8s Service + Endpoints →
10.0.10.10:3773 plus an ingress_factory route (dns_type=proxied,
auth="app"). t3 ships its own owner-pairing + bearer-session auth, so
Authentik forward-auth is intentionally omitted — it would break the
cross-origin native mobile app and app.t3.codes (bearer-only, no
Authentik cookie). CrowdSec + anti-AI (both default-on for app-tier)
rate-limit the public surface; t3's pairing is the gate. TLS is
auto-synced into the namespace by Kyverno's sync-tls-secret policy.

Verified end-to-end: t3.viktorbarzin.me → CF → Traefik → devvm:3773 = 200.
Trade-off (public RCE surface behind app-native auth, no Authentik SSO)
accepted 2026-06-01 to keep the native app + app.t3.codes working.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
c5e4b1ea71 kms: add /diag anonymous telemetry collector behind Anubis carve-out
The PowerShell activation scripts POST small JSON diagnostics to
/diag so script execution errors are captured. The collector
(python:3.12-alpine, ConfigMap-mounted) prints each event to stdout
as a KMSDIAG line; the cluster's Loki scrapes pod stdout, making
events searchable in Grafana (Loki only — no Slack, no Prometheus).

Like /scripts, /diag needs a second ingress_factory carve-out with
full_host="kms.viktorbarzin.me" so it bypasses the Anubis PoW
challenge that PowerShell/curl can't solve. Without full_host the
factory would derive kms-diag.viktorbarzin.me and the carve-out
would never match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
3fa9e2409c runbook: K8s worker scaling for PVC capacity headroom
Documents the 6-worker cluster shape (post 2026-05-26 scale-up after
the proxmox-csi LUN-cap incident), the six binding constraints (plugin
LUN cap at 29/VM, memory commit, sdc IO contention, GPU concentration
on node1, PVE host memory, no Terraform management for K8s VMs), and
the playbooks for adding/removing workers.

Scale-up triggers:
  - max-node VA count ≥ 25 (~86% of 29 cap) for ≥7 days
  - cluster memory requests > 90%
  - LUN-cap incident
  - planned ≥3 net-new block PVCs when max VA already ≥ 22
Scale-down conditions:
  - max-node PVC count ≤ 20, memory < 70%/95% for ≥30 days

Playbooks lean on scripts/provision-k8s-worker (clones template 2000,
cloud-inits, auto-joins) for adds; kubectl cordon → drain → delete
node → qm shutdown for removes. Cold-spare option documented.

Related: docs/architecture/storage.md § Per-VM SCSI-LUN cap,
docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md,
beads code-oflt (IO contention long-term fix).
2026-06-01 19:50:41 +00:00
Viktor Barzin
5c77482a8c fire-planner: LLM_MODEL env var → qwen3vl-4b default (fits in current GPU headroom; immich-ml is holding ~10GB) 2026-06-01 19:50:41 +00:00
Viktor Barzin
fb1e47a20a nextcloud: re-enable Keel auto-upgrades with occ-upgrade self-heal + live-tag floor
Re-enrolls Nextcloud in Keel (opted out after the 2026-05-26 32.0.3->32.0.9
bump stuck the pod in maintenance mode ~22h). Two safeguards engineer around
both failure modes:

- F1 (interrupted occ upgrade -> 503): nextcloud-watchdog CronJob runs
  `occ upgrade` + clears maintenance mode when occ reports needsDbUpgrade=true;
  Job deadline bumped 120->600s so it isn't killed mid-migration.
- F2 (helm re-renders a tag below the Keel-bumped live image -> downgrade
  CrashLoop): chart_values renders the live tag via a plural
  kubernetes_resources data source (empty-list-on-absence -> floor 32.0.9 on
  fresh install/DR), so a re-render never downgrades below live.

Scope is patch -- Kyverno's shared inject-keel-annotations policy stamps it and
its background-controller overrides a TF-set value, and patch == minor for
Nextcloud in practice (32.0.x only; major 33 stays manual). Dropped the
per-workload keel.sh/policy override resources to avoid perpetual drift; ns
enrollment + Kyverno now own the keel annotations like other workloads.

Also bumps the external-storage bootstrap Job create timeout 1m->12m to match
its own 10m pod-wait, since Keel bumps now roll the pod mid-apply.

Verified: Keel auto-upgraded 32.0.9->32.0.10 on apply, entrypoint occ upgrade
completed clean (no watchdog needed), pod 2/2, HTTP 200, plan shows no drift.
2026-06-01 19:50:41 +00:00
Viktor Barzin
50d0f1affa kyverno: strip orphaned keel.sh/match-tag fleet-wide (image-swap fix)
The 2026-05-26 migration flipped the keel default force->patch and dropped
match-tag from the inject-keel-annotations patch, but Kyverno's add-only
mutate can't remove an annotation that's no longer listed -- 194 workloads
kept a stale keel.sh/match-tag=true. Under it Keel cross-assigned images in
multi-image pods: the blog's nginx<->nginx-exporter images were swapped and
the site was down 2026-05-26 -> 06-01 (nginx received the exporter's
-nginx.scrape-uri arg and CrashLoopBackOff'd); changedetection was silently
swapped (app lost its /datastore PVC + env, ran ephemeral for days).

- policy now sets keel.sh/match-tag=null (strips on admission, never re-added)
- swept the annotation off all 194 existing workloads (kubectl, no pod restart)
- AGENTS.md: documents the strip; post-mortem added

blog + changedetection un-swapped via kubectl set image (TF-ignored images);
both 2/2 and serving 200. Policy already applied via scripts/tg (Tier-1 PG
state authoritative). [ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 19:50:41 +00:00
Viktor Barzin
769ae7a6d3 traefik: bot-block-proxy buffer 256k + document the real HTTP/2 limit
Follow-up to the 64k bump: raised bot-block-proxy large_client_header_buffers
to 256k and corrected the rationale. Investigation found the *binding* limit
for browsers is Traefik's HTTP/2 header cap (~64KB, Go maxHeaderListSize, not
exposed by Traefik config) — oversized authentik_proxy_* cookie piles are
rejected at the h2 layer upstream of bot-block regardless of these buffers.
The real fix for >64KB piles is reducing authentik_proxy_* cookie accumulation
(or clearing cookies); these buffers only prevent bot-block being a tighter
bottleneck for sub-64KB piles + HTTP/1.1 clients.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:27 +00:00
Viktor Barzin
1c165ce5b4 docs(kms): document the consequence-gated edition switch (changepk + ODT)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:27 +00:00
Viktor Barzin
3d28870e25 nextcloud: fix backup retention to sort by name, not mtime
The dated backup dirs are named YYYYMMDD_HHMMSS, but the cleanup used
`ls -dt` (mtime). `rsync -a` stamps the backup dir with the SOURCE dir's
mtime, so the freshest backup didn't sort as newest — the retention step
deleted the new backup and kept a stale one. Sort lexically (chronological
for these names) and keep the last.

Also exclude html/ (the app code, reproducible from the now-pinned image;
the real config lives at config/config.php, html/config is empty) so the
backup is config+data+custom_apps only → ~4.3G (<5G target).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:26 +00:00
root
84ab4c998c Woodpecker CI deploy [CI SKIP] 2026-06-01 15:15:26 +00:00
Viktor Barzin
ddd582a28c backup: stop offsite-copying regenerable data; shrink nextcloud backup; pin nextcloud image
The offsite Synology hit 97% — the Backup share grew +670G in a week, traced
to the 2026-05-26 change that began mirroring large regenerable services
offsite, plus an unbounded nextcloud.log bloating its backups to 87G.

- nfs-mirror: re-exclude ollama, prometheus-backup, audiblez, ebook2audiobook
  (regenerable; live-only on sdc). Keep *-backup DB dumps (real safety copies).
- offsite-sync Step 2: nfs-ssd leg is now immich-only; ollama/llamacpp on the
  SSD no longer ship offsite (re-pullable models).
- daily-backup: skip nextcloud/nextcloud-data-proxmox (orphaned pre-encryption
  PV, still backed up weekly).
- nextcloud: cap+rotate the log (log_rotate_size=10MB); the dedicated backup
  now excludes html/ (app code, from image), logs, and preview cache and keeps
  only the latest copy (pvc-data holds version history) → <5G (was 87G).
- nextcloud: pin image to 32.0.9 in chart_values. A 2026-05-26 Keel bump moved
  the live pod to 32.0.9 (data migrated to 32.0.9.2) but TF still defaulted to
  32.0.3; reconciling that drift this session rolled a 32.0.3 pod that
  CrashLooped on the downgrade. Pinning eliminates the drift.

Docs: backup-dr.md + infra CLAUDE.md updated (add nfs-mirror, new exclusions).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:26 +00:00
Viktor Barzin
0dd4a31eff docs(immich): cap server-side job concurrency to protect sdc + log recurrence
A library-wide Duplicate Detection run on 2026-06-01 fanned the ML/thumbnail
backfill out at thumbnailGeneration concurrency 8, saturating the shared sdc
HDD and starving etcd -> kube-apiserver down ~30 min (5th IO-pressure incident
on sdc). Capped server-side thumbnailGeneration/metadataExtraction/library to 2
in the Immich DB system-config; documented in the Immich row and recorded the
recurrence + still-TODO IO-isolation fixes in the 2026-05-25 post-mortem (this
also commits that previously-untracked post-mortem).

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 15:15:26 +00:00
Viktor Barzin
af4bfbe046 kms: revert files accidentally bundled into the docs commit
The previous commit (81a7d804) swept in 23 unrelated working-tree files because
a rebase --autostash had left them staged in the index — including 4 files with
leftover git conflict markers (llama-cpp/main.tf, excalidraw/providers.tf,
url + wealthfolio .terraform.lock.hcl) from a stale 2026-05-25 stash, which is
invalid Terraform. Revert all 23 (terragrunt-generated backend/providers/lock
+ the llama-cpp markers) to their prior committed state; terragrunt regenerates
the generated files on the next run. Net effect of the docs commit is now just
the runbook doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
Viktor Barzin
bdb0cef242 docs(kms): document /keys.json carve-out + script auto-key selection
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
Viktor Barzin
170a3bb052 traefik: bump bot-block-proxy large_client_header_buffers to 8x64k
The ai-bot-block forward-auth copies the full request (incl. the
accumulated authentik_proxy_<random> cookie pile) to bot-block-proxy.
With 30+ Authentik Proxy Providers under viktorbarzin.me the combined
Cookie header exceeds openresty's default 4x8k buffers, so the auth
check returned 400 "Request Header Or Cookie Too Large" (surfaced as
error-pages' "Too big request header" 431) and broke Woodpecker/Forgejo
OAuth sign-in for affected browsers.

Mirror the existing auth-proxy-config fix: 8x64k accepts the pile.
Applied live via tg apply + bot-block-proxy rollout restart.

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
Viktor Barzin
6f0bdf2993 kms: carve /keys.json out of Anubis for script auto-key-selection
The activation scripts now fetch the published GVLK list from /keys.json to
auto-select the right key for the detected edition. Like the .ps1 scripts,
that endpoint must bypass Anubis (PowerShell/ConvertFrom-Json can't solve the
PoW). Add /keys.json to the ingress_scripts carve-out path list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
root
7a297deb24 Woodpecker CI deploy [CI SKIP] 2026-06-01 10:36:49 +00:00
Viktor Barzin
e63a812062 kms: dedicated vlmcs.viktorbarzin.me endpoint + Anubis /scripts carve-out
Internal split-horizon resolves kms.viktorbarzin.me to Traefik (10.0.20.203),
which has no :1688 listener — so LAN clients pointed at kms.viktorbarzin.me:1688
failed with 0xC004F074 "no KMS could be contacted". Add a dedicated A-only
vlmcs.viktorbarzin.me (cloudflare_record.vlmcs -> 176.12.22.76 for the public
WAN NAT; Technitium -> 10.0.20.202 internal, set via API) so it resolves to
vlmcsd both ways. Also carve /scripts/* out of Anubis (module.ingress_scripts
-> bare kms-web-page service) so `iwr | iex` downloads the real script instead
of the PoW challenge HTML.

Verified end-to-end on Win VM 300: reproduced 0xC004F074 on the old host, then
slmgr + ospp + both PowerShell one-liners all -> Licensed via vlmcs (10.0.20.202).

Docs: kms-public-exposure runbook + service-catalog entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-01 10:36:49 +00:00
root
de04ed099e Woodpecker CI Update TLS Certificates Commit 2026-06-01 10:36:49 +00:00
Viktor Barzin
e5d9160a88 monitoring: KEEL/tier ignore_changes on 5 exporters [ci skip]
goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify
daemonset were missed by the cdb7d9a8 KEEL_LIFECYCLE sweep. The monitoring ns
is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh
annotations; TF kept trying to revert both, plus a live-stamped tier label —
which made `terragrunt plan -detailed-exitcode` return 2 every run and the
drift-detection cron fail daily. Add the standard KEEL ignore_changes (image +
keel.sh annotations) and ignore the tier label so these stop churning.

Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this
does not trigger a monitoring apply. Remaining (separate) drift: the grafana
ACL null_resource (triggers.always) + tls cert refresh.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 15:33:30 +00:00
Viktor Barzin
935fb07df7 hermes-agent: gate PVC on parked flag (clears PVCStuckPending)
The data_proxmox PVC is WaitForFirstConsumer; with the Deployment parked at
replicas=0 it had no consumer pod and sat Pending forever, falsely tripping
PVCStuckPending (which halts kured reboots). Introduce local.hermes_parked to
drive both replicas and the PVC count, so a parked service has no PVC at all.
Empty/never-bound PVC removed; recreated automatically when un-parked.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 15:19:28 +00:00
Viktor Barzin
7b6a0e70af hermes-agent: opt out of external monitor while parked
hermes-agent is parked at replicas=0 (PVC perms bug, 2026-04-22). Its
auto-created Uptime Kuma external monitor was down → ExternalAccessDivergence
firing, which halts kured node reboots. Set external_monitor=false so a
deliberately-down service stops tripping the divergence gate. Re-enable when
the deployment is brought back up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 15:12:33 +00:00
Viktor Barzin
51313ee088 kured: fix sentinel-gate OOM — 256Mi limit + self-restart leak guard
The k8s-master gate pod OOM-killed child kubectls 149x/7d (accelerating:
0/day → 15 → 134) while master sat in pending-reboot. Root cause: only the
pending-reboot node's gate pod runs the kubectl-heavy hot path each cycle,
and the immortal bash loop slowly leaks (kubectl forks + Check-4 process
substitution) past the 64Mi cgroup limit. PID 1 bash survives each kill, so
the pod never restarts — just silent oom_events.

Fix: raise limit 64Mi→256Mi (headroom for ~30-50Mi kubectl forks) + add a
MAX_ITER=72 self-exit (~6h) so kubelet restarts the pod fresh and the leak
can never accumulate, regardless of how long a node stays pending-reboot.

Docs: post-mortem + automated-upgrades.md gate note.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-31 14:49:04 +00:00
Viktor Barzin
0c64fc2948 travel-agent: switch from Slack webhook to bot token (chat.postMessage) 2026-05-30 22:44:11 +00:00
Viktor Barzin
46f63bb70e infra: travel-agent stack (namespace + ExternalSecret + 2 CronJobs) 2026-05-30 18:24:13 +00:00
Viktor Barzin
e1ab23193d redis: revert 3-node Sentinel HA to single standalone instance [ci skip]
The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network
partition, hit the init script's deterministic "pod-0 = bootstrap master"
fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2.
HAProxy's `expect rstring role:master` matched both and round-robined client
connections across the two diverging masters, so Immich enqueued BullMQ jobs on
one while its workers blocked-popped on the other -> every queue wedged and
new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6
weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade).

Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy +
init bootstrap configmap + both PDBs; redis container only (+ exporter).
maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both
workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich
BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer
edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved).
Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source +
docs only, hence [ci skip].

Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas
now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop.

Docs: rewrite databases.md Redis section (single-instance design + incident
history); add post-mortem 2026-05-30-redis-split-brain.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:49:43 +00:00
Viktor Barzin
5bcb4525a4 traefik: uncap download duration (writeTimeout 60s->0), upload window 3600s [ci skip]
Large Immich video downloads and uploads failed at a hard ~60s wall. The
websecure entrypoint set respondingTimeouts.{read,write}Timeout=60s; unlike
nginx proxy_*_timeout (per-read idle), Traefik respondingTimeouts are hard caps
on total request/response duration, so every transfer slower than 60s was cut
mid-stream. Reproduced: a 6 MB/s throttled 650MB download died at 386MB / 62s
with an HTTP/2 stream reset.

- writeTimeout=0 (Traefik's default, which Immich's reverse-proxy guidance
  assumes): unlimited download size/duration.
- readTimeout=3600s: passes multi-GB uploads while keeping a slow-loris backstop
  (Immich has no resumable upload, so the window must exceed real upload times).

Verified: the same 650MB download now completes fully (650MB / 102s, exit 0).
IPv6 path needs no change - the pfSense bridge HAProxy 1h timeouts are
inactivity-based, not total caps. Applied via tg (Tier 1 / PG-authoritative
state); this commit syncs source + docs only, hence [ci skip].

Docs: networking.md (Entrypoint Transport Timeouts + troubleshooting),
.claude/CLAUDE.md networking note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 17:46:59 +00:00
Viktor Barzin
89561c7779 technitium: complete Traefik .200->.203 migration for the .lan zone [ci skip]
Today's Traefik dedicated-IP migration (.200 -> .203, ETP=Local) updated
the viktorbarzin.me zone but missed the viktorbarzin.lan zone + two stale
.200 literals — breaking every *.viktorbarzin.lan ingress host (internal
exporters + ~15 HA-Sofia sensors via idrac-redfish/nvidia/snmp) and
tripping the apex-drift probe. Found via /cluster-health (23 alerts -> 7).

- apex-probe EXPECTED .200 -> .203 (apex IS .203; probe asserted the wrong
  value -> false ViktorBarzinApexDrift "critical").
- split-horizon externalToInternalTranslation .200 -> .203 (sofia-lan
  hairpin-NAT target).
- ingress-dns-sync CronJob now also pins ingress.viktorbarzin.lan A to the
  LIVE Traefik LB IP (queried from svc/traefik) every run, so a future
  Traefik IP move can't silently break the .lan zone again. Added
  services get/list to its ClusterRole.

Applied via targeted apply (4 resources, 0 destroyed) + manual CronJob
triggers; verified apex correct=1 and the .lan anchor self-pins to .203.
[ci skip] because a full technitium apply would also pick up unrelated
pre-existing deployment drift (DNS pod restart risk) — left untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 16:54:09 +00:00
Viktor Barzin
a222c024fd docs: correct tripit DNS classification to proxied [ci skip]
tripit's ingress is dns_type="proxied" (Cloudflare), not non-proxied.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 15:00:49 +00:00
Viktor Barzin
b78378eda9 docs: catalog tripit service (service-catalog + databases) [ci skip]
Add tripit (self-hosted TripIt-clone travel-itinerary PWA) to the
service catalog Optional tier and Non-Proxied DNS list, and to the
CNPG consumer + PostgreSQL rotation lists in the databases doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 14:59:01 +00:00
Viktor Barzin
c2b820dc55 postiz: adopt drifted resources into TF state; exclude stuck Helm release
The 2026-05-24 apply was interrupted with the Helm release stuck in
pending-install, leaving only 2 of ~12 resources in TF state (any apply
errored "already exists"). Adopted the live resources back via import {}
sweep (namespace, tls-secret, uploads PVC, ESO ExternalSecret, both
ingresses, temporal Service, nfs backup PV+PVC) — plan now reaches zero.

Reconciled code to live reality (zero runtime change to running postiz):
- Removed kubernetes_deployment.temporal + kubernetes_job.temporal_search_
  attr_cleanup: the temporal Deployment is gone from the cluster (only the
  Service survives). Scheduled posts remain unavailable until temporal is
  restored; immediate posting works.
- Removed helm_release.postiz from TF entirely: importing it would force a
  helm upgrade (provider can't match merged values to config) and the
  release is stuck pending-install. Left Helm-managed outside TF.
- Removed keel.sh/enrolled=true from the namespace (postiz was opted out of
  Keel on 2026-05-29; this would have re-enrolled it on apply).
- Backup CronJob now dumps only the `postiz` DB (temporal/temporal_visibility
  DBs don't exist) and no longer depends_on the removed helm_release.

Applied: 9 imported, 1 added (backup CronJob), 6 changed (benign), 0 destroyed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 14:36:07 +00:00
Viktor Barzin
01351e4ce2 tripit: deploy stack + DB provisioning + ongoing mail-ingest [ci skip]
- stacks/tripit: namespace, ESO (vault-kv + vault-database), Deployment
  (alembic init + app), Service, NFS document PVC, ingress (Authentik
  forward-auth) + /api/calendar carve-out (auth=none, HMAC-token gated),
  and 3 worker CronJobs. ingest-mail is live: real IMAP (me@, read-only
  BODY.PEEK, recent-30) + local LLM (qwen3vl-4b on llama-swap), idempotent
  (skips seen message_ids), owner me@viktorbarzin.me.
- stacks/dbaas: create CNPG role+db `tripit`.
- stacks/vault: pg-tripit static role (7d rotation) + allowed_roles entry.

Deployed at tripit.viktorbarzin.me. [ci skip]: stacks were applied
out-of-band via scripts/tg this session; a CI re-apply would also apply
unrelated pre-existing dbaas/vault drift (MySQL StatefulSet, vault OIDC).

Refs: code-bb9g, code-muqi

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 10:23:11 +00:00
Viktor Barzin
e9046e5a26 traefik+pfsense: real IPv6 client IPs via HAProxy PROXY-v2 bridge
Replace the pfSense socat IPv6 forwarder (which masked every IPv6 client
as 10.0.20.1) with a standalone HAProxy bridge using send-proxy-v2, so
real IPv6 client IPs reach Traefik/CrowdSec. Traefik now trusts PROXY-v2
only from 10.0.20.1 on the web/websecure entrypoints; real IPv4 clients
(ETP=Local, own source IP) are unaffected. Mail-over-IPv6 routed through
the mail NodePorts (send-proxy-v2) too. Bridge is TCP/h2 only (no QUIC
over IPv6). Persistence on pfSense: rc.d/ipv6proxy + ipv6_proxy.sh
(config.xml shellcmd), keeping the nginx-off-[::] patch.

Also fixes stale networking.md: Traefik was still documented on the
shared .200; it moved to dedicated .203/ETP=Local on 2026-05-30.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 09:51:23 +00:00
Viktor Barzin
16c9aafafa docs: Traefik dedicated-IP + ETP=Local cutover SUCCEEDED (attempt 2)
Records the successful cutover and the key fix that made it safe: decouple
cloudflared from the LB IP first (point its tunnel ingress at the in-cluster
Traefik Service), so moving Traefik 10.0.20.200 -> 10.0.20.203 no longer
breaks proxied apps or Vault's ingress. Updates infra CLAUDE.md Networking
notes with the new Traefik LB IP / ETP=Local / cloudflared->ClusterIP state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 08:12:57 +00:00
Viktor Barzin
0c01adac95 traefik: dedicate LB IP 10.0.20.203 + externalTrafficPolicy=Local
Gives direct (non-proxied) apps real client IPs for CrowdSec (were SNAT'd to
the node IP under ETP=Cluster) and working QUIC. Companion change (NOT in TF —
remote cloudflared tunnel config, done via CF API): tunnel ingress repointed
from https://10.0.20.200:443 to https://traefik.traefik.svc.cluster.local:443
so proxied apps are decoupled from the LB IP. pfSense 443 NAT -> traefik_lb
alias (.203). See docs/plans/2026-05-30-traefik-dedicated-ip-etp-local-*.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 08:09:37 +00:00
Viktor Barzin
d6a61f00ad state(vault): update encrypted state 2026-05-30 07:59:28 +00:00
Viktor Barzin
aceee34889 state(dbaas): update encrypted state 2026-05-30 07:55:42 +00:00
Viktor Barzin
1473a94f29 docs/plans: Traefik dedicated-IP cutover attempt 1 post-mortem (rolled back)
Attempt rolled back to .200 baseline. Root blocker: cloudflared is a
token/dashboard-managed tunnel whose ingress targets the Traefik LB IP
(10.0.20.200), so moving Traefik to .203 took down all proxied apps. Retry
must also repoint the tunnel ingress (Cloudflare API). Also documents the
vault-ingress circular dep, SIGPIPE->stuck PG state-lock gotcha, and the
ETP=Local hairpin caveat.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-30 01:27:29 +00:00